Materials data science: descriptors and machine learning¶

Welcome to the materials data science lesson. In this session, we will demonstrate how to use matminer, pandas and scikit-learn for machine learning materials properties.

The lesson is split into four sections: 1. Data retrieval and basic analysis of pandas DataFrame objects. 2. Generating machine learnable descriptors. 3. Training, testing and visualizing machine learning methods with scikit-learn and plotly.express.

Many more tutorials on how to use matminer (beyond the scope of this workshop) are available in the matminer_examples repository, available here.

Machine learning workflow¶

Firstly, what does a typical machine learning workflow look like? The overall process can be summarized as: 1. Take raw inputs, such as a list of compositions, and an associated target property to learn. 2. Convert the raw inputs into descriptors or features that can be learned by machine learning algorithms. 3. Train a machine learning model on the data. 4. Plot and analyze the performance of the model.

machine learning workflow

Typically, questions asked by a new practitioner in the field include: - Where do we get the raw data from? - How do we convert the raw data into learnable features? - How can we plot and interpret the results of a model?

The matminer package has been developed to help make machine learning of materials properties easy and hassle free. The aim of matminer is to connect materials data with data mining algorithms and data visualization.

matminer overview

Part 1: Data retrieval and filtering¶

Matminer interfaces with many materials databases, including: - Materials Project - Citrine - AFLOW - Materials Data Facility (MDF) - Materials Platform for Data Science (MPDS)

In addition, it also includes datasets from published literature. Matminer hosts a repository of 42 (and growing) datasets which comes from published and peer-reviewed machine learning investigations of materials properties or publications of high-throughput computing studies.

In this section, we will show how to access and manipulate the pre-formatted datasets from the published literature. More information on accessing other materials databases are detailed in the matminer_examples repository.

A list of the literature-based datasets can be printed using the get_available_datasets() function. This also prints information about what the dataset contains, such as the number of samples, the target properties, and how the data was obtained (e.g., via theory or experiment).

from matminer.datasets import get_available_datasets

get_available_datasets()

boltztrap_mp: Effective mass and thermoelectric properties of 8924 compounds in The  Materials Project database that are calculated by the BoltzTraP software package run on the GGA-PBE or GGA+U density functional theory calculation results. The properties are reported at the temperature of 300 Kelvin and the carrier concentration of 1e18 1/cm3.

brgoch_superhard_training: 2574 materials used for training regressors that predict shear and bulk modulus.

castelli_perovskites: 18,928 perovskites generated with ABX combinatorics, calculating gllbsc band gap and pbe structure, and also reporting absolute band edge positions and heat of formation.

citrine_thermal_conductivity: Thermal conductivity of 872 compounds measured experimentally and retrieved from Citrine database from various references. The reported values are measured at various temperatures of which 295 are at room temperature.

dielectric_constant: 1,056 structures with dielectric properties, calculated with DFPT-PBE.

double_perovskites_gap: Band gap of 1306 double perovskites (a_1-b_1-a_2-b_2-O6) calculated using Gritsenko, van Leeuwen, van Lenthe and Baerends potential (gllbsc) in GPAW.

double_perovskites_gap_lumo: Supplementary lumo data of 55 atoms for the double_perovskites_gap dataset.

elastic_tensor_2015: 1,181 structures with elastic properties calculated with DFT-PBE.

expt_formation_enthalpy: Experimental formation enthalpies for inorganic compounds, collected from years of calorimetric experiments. There are 1,276 entries in this dataset, mostly binary compounds. Matching mpids or oqmdids as well as the DFT-computed formation energies are also added (if any).

expt_formation_enthalpy_kingsbury: Dataset containing experimental standard formation enthalpies for solids. Formation enthalpies were compiled primarily from Kim et al., Kubaschewski, and the NIST JANAF tables (see references). Elements, liquids, and gases were excluded. Data were deduplicated such that each material is associated with a single formation enthalpy value. Refer to Wang et al. (see references) for a complete desciption of the methods used. Materials Project database IDs (mp-ids) were assigned to materials from among computed materials in the Materials Project database (version 2021.03.22) that were 1) not marked 'theoretical', 2) had structures matching at least one ICSD material, and 3) were within 200 meV of the DFT-computed stable energy hull (e_above_hull < 0.2 eV). Among these candidates, we chose the mp-id with the lowest e_above_hull that matched the reported spacegroup (where available).

expt_gap: Experimental band gap of 6354 inorganic semiconductors.

expt_gap_kingsbury: Identical to the matbench_expt_gap dataset, except that Materials Project database IDs (mp-ids) have been associated with each material using the same method as described for the expt_formation_enthalpy_kingsbury dataset. Columns have also been renamed for consistency with the formation enthalpy data.

flla: 3938 structures and computed formation energies from "Crystal Structure Representations for Machine Learning Models of Formation Energies."

glass_binary: Metallic glass formation data for binary alloys, collected from various experimental techniques such as melt-spinning or mechanical alloying. This dataset covers all compositions with an interval of 5 at. % in 59 binary systems, containing a total of 5959 alloys in the dataset. The target property of this dataset is the glass forming ability (GFA), i.e. whether the composition can form monolithic glass or not, which is either 1 for glass forming or 0 for non-full glass forming.

glass_binary_v2: Identical to glass_binary dataset, but with duplicate entries merged. If there was a disagreement in gfa when merging the class was defaulted to 1.

glass_ternary_hipt: Metallic glass formation dataset for ternary alloys, collected from the high-throughput sputtering experiments measuring whether it is possible to form a glass using sputtering. The hipt experimental data are of the Co-Fe-Zr, Co-Ti-Zr, Co-V-Zr and Fe-Ti-Nb ternary systems.

glass_ternary_landolt: Metallic glass formation dataset for ternary alloys, collected from the "Nonequilibrium Phase Diagrams of Ternary Amorphous Alloys,’ a volume of the Landolt– Börnstein collection. This dataset contains experimental measurements of whether it is possible to form a glass using a variety of processing techniques at thousands of compositions from hundreds of ternary systems. The processing techniques are designated in the "processing" column. There are originally 7191 experiments in this dataset, will be reduced to 6203 after deduplicated, and will be further reduced to 6118 if combining multiple data for one composition. There are originally 6780 melt-spinning experiments in this dataset, will be reduced to 5800 if deduplicated, and will be further reduced to 5736 if combining multiple experimental data for one composition.

heusler_magnetic: 1153 Heusler alloys with DFT-calculated magnetic and electronic properties. The 1153 alloys include 576 full, 449 half and 128 inverse Heusler alloys. The data are extracted and cleaned (including de-duplicating) from Citrine.

jarvis_dft_2d: Various properties of 636 2D materials computed with the OptB88vdW and TBmBJ functionals taken from the JARVIS DFT database.

jarvis_dft_3d: Various properties of 25,923 bulk materials computed with the OptB88vdW and TBmBJ functionals taken from the JARVIS DFT database.

jarvis_ml_dft_training: Various properties of 24,759 bulk and 2D materials computed with the OptB88vdW and TBmBJ functionals taken from the JARVIS DFT database.

m2ax: Elastic properties of 223 stable M2AX compounds from "A comprehensive survey of M2AX phase elastic properties" by Cover et al. Calculations are PAW PW91.

matbench_dielectric: Matbench v0.1 test dataset for predicting refractive index from structure. Adapted from Materials Project database. Removed entries having a formation energy (or energy above the convex hull) more than 150meV and those having refractive indices less than 1 and those containing noble gases. Retrieved April 2, 2019. For benchmarking w/ nested cross validation, the order of the dataset must be identical to the retrieved data; refer to the Automatminer/Matbench publication for more details.

matbench_expt_gap: Matbench v0.1 test dataset for predicting experimental band gap from composition alone. Retrieved from Zhuo et al. supplementary information. Deduplicated according to composition, removing compositions with reported band gaps spanning more than a 0.1eV range; remaining compositions were assigned values based on the closest experimental value to the mean experimental value for that composition among all reports. For benchmarking w/ nested cross validation, the order of the dataset must be identical to the retrieved data; refer to the Automatminer/Matbench publication for more details.

matbench_expt_is_metal: Matbench v0.1 test dataset for classifying metallicity from composition alone. Retrieved from Zhuo et al. supplementary information. Deduplicated according to composition, ensuring no conflicting reports were entered for any compositions (i.e., no reported compositions were both metal and nonmetal). For benchmarking w/ nested cross validation, the order of the dataset must be identical to the retrieved data; refer to the Automatminer/Matbench publication for more details.

matbench_glass: Matbench v0.1 test dataset for predicting full bulk metallic glass formation ability from chemical formula. Retrieved from "Nonequilibrium Phase Diagrams of Ternary Amorphous Alloys,’ a volume of the Landolt– Börnstein collection. Deduplicated according to composition, ensuring no compositions were reported as both GFA and not GFA (i.e., all reports agreed on the classification designation). For benchmarking w/ nested cross validation, the order of the dataset must be identical to the retrieved data; refer to the Automatminer/Matbench publication for more details.

matbench_jdft2d: Matbench v0.1 test dataset for predicting exfoliation energies from crystal structure (computed with the OptB88vdW and TBmBJ functionals). Adapted from the JARVIS DFT database. For benchmarking w/ nested cross validation, the order of the dataset must be identical to the retrieved data; refer to the Automatminer/Matbench publication for more details.

matbench_log_gvrh: Matbench v0.1 test dataset for predicting DFT log10 VRH-average shear modulus from structure. Adapted from Materials Project database. Removed entries having a formation energy (or energy above the convex hull) more than 150meV and those having negative G_Voigt, G_Reuss, G_VRH, K_Voigt, K_Reuss, or K_VRH and those failing G_Reuss <= G_VRH <= G_Voigt or K_Reuss <= K_VRH <= K_Voigt and those containing noble gases. Retrieved April 2, 2019. For benchmarking w/ nested cross validation, the order of the dataset must be identical to the retrieved data; refer to the Automatminer/Matbench publication for more details.

matbench_log_kvrh: Matbench v0.1 test dataset for predicting DFT log10 VRH-average bulk modulus from structure. Adapted from Materials Project database. Removed entries having a formation energy (or energy above the convex hull) more than 150meV and those having negative G_Voigt, G_Reuss, G_VRH, K_Voigt, K_Reuss, or K_VRH and those failing G_Reuss <= G_VRH <= G_Voigt or K_Reuss <= K_VRH <= K_Voigt and those containing noble gases. Retrieved April 2, 2019. For benchmarking w/ nested cross validation, the order of the dataset must be identical to the retrieved data; refer to the Automatminer/Matbench publication for more details.

matbench_mp_e_form: Matbench v0.1 test dataset for predicting DFT formation energy from structure. Adapted from Materials Project database. Removed entries having formation energy more than 2.5eV and those containing noble gases. Retrieved April 2, 2019. For benchmarking w/ nested cross validation, the order of the dataset must be identical to the retrieved data; refer to the Automatminer/Matbench publication for more details.

matbench_mp_gap: Matbench v0.1 test dataset for predicting DFT PBE band gap from structure. Adapted from Materials Project database. Removed entries having a formation energy (or energy above the convex hull) more than 150meV and those containing noble gases. Retrieved April 2, 2019. For benchmarking w/ nested cross validation, the order of the dataset must be identical to the retrieved data; refer to the Automatminer/Matbench publication for more details.

matbench_mp_is_metal: Matbench v0.1 test dataset for predicting DFT metallicity from structure. Adapted from Materials Project database. Removed entries having a formation energy (or energy above the convex hull) more than 150meV and those containing noble gases. Retrieved April 2, 2019. For benchmarking w/ nested cross validation, the order of the dataset must be identical to the retrieved data; refer to the Automatminer/Matbench publication for more details.

matbench_perovskites: Matbench v0.1 test dataset for predicting formation energy from crystal structure. Adapted from an original dataset generated by Castelli et al. For benchmarking w/ nested cross validation, the order of the dataset must be identical to the retrieved data; refer to the Automatminer/Matbench publication for more details.

matbench_phonons: Matbench v0.1 test dataset for predicting vibration properties from crystal structure. Original data retrieved from Petretto et al. Original calculations done via ABINIT in the harmonic approximation based on density functional perturbation theory. Removed entries having a formation energy (or energy above the convex hull) more than 150meV. For benchmarking w/ nested cross validation, the order of the dataset must be identical to the retrieved data; refer to the Automatminer/Matbench publication for more details.

matbench_steels: Matbench v0.1 test dataset for predicting steel yield strengths from chemical composition alone. Retrieved from Citrine informatics. Deduplicated. For benchmarking w/ nested cross validation, the order of the dataset must be identical to the retrieved data; refer to the Automatminer/Matbench publication for more details.

mp_all_20181018: A complete copy of the Materials Project database as of 10/18/2018. mp_all files contain structure data for each material while mp_nostruct does not.

mp_nostruct_20181018: A complete copy of the Materials Project database as of 10/18/2018. mp_all files contain structure data for each material while mp_nostruct does not.

phonon_dielectric_mp: Phonon (lattice/atoms vibrations) and dielectric properties of 1296 compounds computed via ABINIT software package in the harmonic approximation based on density functional perturbation theory.

piezoelectric_tensor: 941 structures with piezoelectric properties, calculated with DFT-PBE.

ricci_boltztrap_mp_tabular: Ab-initio electronic transport database for inorganic materials. Complex multivariable BoltzTraP simulation data is condensed down into tabular form of two main motifs: average eigenvalues at set moderate carrier concentrations and temperatures, and optimal values among all carrier concentrations and temperatures within certain ranges. Here are reported the average of the eigenvalues of conductivity effective mass (mₑᶜᵒⁿᵈ), the Seebeck coefficient (S), the conductivity (σ), the electronic thermal conductivity (κₑ), and the Power Factor (PF) at a doping level of 10¹⁸ cm⁻³ and at a temperature of 300 K for n- and p-type. Also, the maximum values for S, σ, PF, and the minimum value for κₑ chosen among the temperatures [100, 1300] K, the doping levels [10¹⁶, 10²¹] cm⁻³, and doping types are reported. The properties that depend on the relaxation time are reported divided by the constant value 10⁻¹⁴. The average of the eigenvalues for all the properties at all the temperatures, doping levels, and doping types are reported in the tables for each entry. Data is indexed by materials project id (mpid)

steel_strength: 312 steels with experimental yield strength and ultimate tensile strength, extracted and cleaned (including de-duplicating) from Citrine.

wolverton_oxides: 4,914 perovskite oxides containing composition data, lattice constants, and formation + vacancy formation energies. All perovskites are of the form ABO3. Adapted from a dataset presented by Emery and Wolverton.

['boltztrap_mp',
 'brgoch_superhard_training',
 'castelli_perovskites',
 'citrine_thermal_conductivity',
 'dielectric_constant',
 'double_perovskites_gap',
 'double_perovskites_gap_lumo',
 'elastic_tensor_2015',
 'expt_formation_enthalpy',
 'expt_formation_enthalpy_kingsbury',
 'expt_gap',
 'expt_gap_kingsbury',
 'flla',
 'glass_binary',
 'glass_binary_v2',
 'glass_ternary_hipt',
 'glass_ternary_landolt',
 'heusler_magnetic',
 'jarvis_dft_2d',
 'jarvis_dft_3d',
 'jarvis_ml_dft_training',
 'm2ax',
 'matbench_dielectric',
 'matbench_expt_gap',
 'matbench_expt_is_metal',
 'matbench_glass',
 'matbench_jdft2d',
 'matbench_log_gvrh',
 'matbench_log_kvrh',
 'matbench_mp_e_form',
 'matbench_mp_gap',
 'matbench_mp_is_metal',
 'matbench_perovskites',
 'matbench_phonons',
 'matbench_steels',
 'mp_all_20181018',
 'mp_nostruct_20181018',
 'phonon_dielectric_mp',
 'piezoelectric_tensor',
 'ricci_boltztrap_mp_tabular',
 'steel_strength',
 'wolverton_oxides']

All datasets can be loaded using the load_dataset() function and the database name. To save installation space, the datasets are not automatically downloaded when matminer is installed. Instead, the first time the dataset is loaded, it will be downloaded from the internet and stored in the matminer installation directory.

Let's say we're interested in the dielectric_constant dataset, which contains 1,056 structures with dielectric properties calculated with DFPT-PBE. We can download it with the load_dataset function.

We'll set an environment variable MATMINER_DATA which will tell matminer to download all our dataset to a directory ./data. If you are running this locally, you usually don't need to set this variable as matminer will download the dataset directly to your matminer source code folder.

%env MATMINER_DATA data

from matminer.datasets import load_dataset

df = load_dataset("dielectric_constant")

env: MATMINER_DATA=data
Making dataset storage folder at data
Fetching dielectric_constant.json.gz from https://ndownloader.figshare.com/files/13213475 to data/dielectric_constant.json.gz

Fetching https://ndownloader.figshare.com/files/13213475 in MB: 0.8867839999999999MB [00:00, 19.17MB/s]

We can get some more detailed information about this dataset using the get_all_dataset_info(<dataset>) function from matminer.

from matminer.datasets import get_all_dataset_info

print(get_all_dataset_info("dielectric_constant"))

Dataset: dielectric_constant
Description: 1,056 structures with dielectric properties, calculated with DFPT-PBE.
Columns:
    band_gap: Measure of the conductivity of a material
    cif: optional: Description string for structure
    e_electronic: electronic contribution to dielectric tensor
    e_total: Total dielectric tensor incorporating both electronic and ionic contributions
    formula: Chemical formula of the material
    material_id: Materials Project ID of the material
    meta: optional, metadata descriptor of the datapoint
    n: Refractive Index
    nsites: The \# of atoms in the unit cell of the calculation.
    poly_electronic: the average of the eigenvalues of the electronic contribution to the dielectric tensor
    poly_total: the average of the eigenvalues of the total (electronic and ionic) contributions to the dielectric tensor
    poscar: optional: Poscar metadata
    pot_ferroelectric: Whether the material is potentially ferroelectric
    space_group: Integer specifying the crystallographic structure of the material
    structure: pandas Series defining the structure of the material
    volume: Volume of the unit cell in cubic angstroms, For supercell calculations, this quantity refers to the volume of the full supercell. 
Num Entries: 1056
Reference: Petousis, I., Mrdjenovich, D., Ballouz, E., Liu, M., Winston, D.,
Chen, W., Graf, T., Schladt, T. D., Persson, K. A. & Prinz, F. B.
High-throughput screening of inorganic compounds for the discovery
of novel dielectric and optical materials. Sci. Data 4, 160134 (2017).
Bibtex citations: ['@Article{Petousis2017,\nauthor={Petousis, Ioannis and Mrdjenovich, David and Ballouz, Eric\nand Liu, Miao and Winston, Donald and Chen, Wei and Graf, Tanja\nand Schladt, Thomas D. and Persson, Kristin A. and Prinz, Fritz B.},\ntitle={High-throughput screening of inorganic compounds for the\ndiscovery of novel dielectric and optical materials},\njournal={Scientific Data},\nyear={2017},\nmonth={Jan},\nday={31},\npublisher={The Author(s)},\nvolume={4},\npages={160134},\nnote={Data Descriptor},\nurl={http://dx.doi.org/10.1038/sdata.2016.134}\n}']
File type: json.gz
Figshare URL: https://ndownloader.figshare.com/files/13213475
SHA256 Hash Digest: 8eb24812148732786cd7c657eccfc6b5ee66533429c2cfbcc4f0059c0295e8b6

Manipulating and examining pandas `DataFrame` objects¶

The datasets are made available as pandas DataFrame objects. You can think of these as a type of "spreadsheet" object in Python. DataFrames have several useful methods you can use to explore and clean the data, some of which we'll explore below.

Dataframes are central objects for retrieving and holding data, adding features, training models, and visualizing with graphs in Python. In the rest of the lesson we will be working with them a lot.

Inspecting the dataset¶

The head() function prints a summary of the first few rows of a data set. You can scroll across to see more columns. From this, it is easy to see the types of data available in in the dataset.

df.head()

	material_id	formula	nsites	space_group	volume	structure	band_gap	e_electronic	e_total	n	poly_electronic	poly_total	pot_ferroelectric	cif	meta	poscar
0	mp-441	Rb2Te	3	225	159.501208	[[1.75725875 1.2425695 3.04366125] Rb, [5.271...	1.88	[[3.44115795, -3.097e-05, -6.276e-05], [-2.837...	[[6.23414745, -0.00035252, -9.796e-05], [-0.00...	1.86	3.44	6.23	False	#\#CIF1.1\n###################################...	{u'incar': u'NELM = 100\nIBRION = 8\nLWAVE = F...	Rb2 Te1\n1.0\n5.271776 0.000000 3.043661\n1.75...
1	mp-22881	CdCl2	3	166	84.298097	[[0. 0. 0.] Cd, [ 4.27210959 2.64061969 13.13...	3.52	[[3.34688382, -0.04498543, -0.22379197], [-0.0...	[[7.97018673, -0.29423886, -1.463590159999999]...	1.78	3.16	6.73	False	#\#CIF1.1\n###################################...	{u'incar': u'NELM = 100\nIBRION = 8\nLWAVE = F...	Cd1 Cl2\n1.0\n3.850977 0.072671 5.494462\n1.78...
2	mp-28013	MnI2	3	164	108.335875	[[0. 0. 0.] Mn, [-2.07904300e-06 2.40067320e+...	1.17	[[5.5430849, -5.28e-06, -2.5030000000000003e-0...	[[13.80606079, 0.0006911900000000001, 9.655e-0...	2.23	4.97	10.64	False	#\#CIF1.1\n###################################...	{u'incar': u'NELM = 100\nIBRION = 8\nLWAVE = F...	Mn1 I2\n1.0\n4.158086 0.000000 0.000000\n-2.07...
3	mp-567290	LaN	4	186	88.162562	[[-1.73309900e-06 2.38611186e+00 5.95256328e...	1.12	[[7.09316738, 7.99e-06, -0.0003864700000000000...	[[16.79535386, 8.199999999999997e-07, -0.00948...	2.65	7.04	17.99	False	#\#CIF1.1\n###################################...	{u'incar': u'NELM = 100\nIBRION = 8\nLWAVE = F...	La2 N2\n1.0\n4.132865 0.000000 0.000000\n-2.06...
4	mp-560902	MnF2	6	136	82.826401	[[1.677294 2.484476 2.484476] Mn, [0. 0. 0.] M...	2.87	[[2.4239622, 7.452000000000001e-05, 6.06100000...	[[6.44055613, 0.0020446600000000002, 0.0013203...	1.53	2.35	7.12	False	#\#CIF1.1\n###################################...	{u'incar': u'NELM = 100\nIBRION = 8\nLDAUTYPE ...	Mn2 F4\n1.0\n3.354588 0.000000 0.000000\n0.000...

Sometimes, if a dataset is very large, you will be unable to see all the available columns. Instead, you can see the full list of columns using the columns attribute:

df.columns

Index(['material_id', 'formula', 'nsites', 'space_group', 'volume',
       'structure', 'band_gap', 'e_electronic', 'e_total', 'n',
       'poly_electronic', 'poly_total', 'pot_ferroelectric', 'cif', 'meta',
       'poscar'],
      dtype='object')

A pandas DataFrame includes a function called describe() that helps determine statistics for the various numerical/categorical columns in the data. Note that the describe() function only describes numerical columns by default.

Sometimes, the describe() function will reveal outliers that indicate mistakes in the data.

df.describe()

	nsites	space_group	volume	band_gap	n	poly_electronic	poly_total
count	1056.000000	1056.000000	1056.000000	1056.000000	1056.000000	1056.000000	1056.000000
mean	7.530303	142.970644	166.420376	2.119432	2.434886	7.248049	14.777898
std	3.388443	67.264591	97.425084	1.604924	1.148849	13.054947	19.435303
min	2.000000	1.000000	13.980548	0.110000	1.280000	1.630000	2.080000
25%	5.000000	82.000000	96.262337	0.890000	1.770000	3.130000	7.557500
50%	8.000000	163.000000	145.944691	1.730000	2.190000	4.790000	10.540000
75%	9.000000	194.000000	212.106405	2.885000	2.730000	7.440000	15.482500
max	20.000000	229.000000	597.341134	8.320000	16.030000	256.840000	277.780000

Indexing the dataset¶

We can access a particular column of DataFrame by indexing the object using the column name. For example:

df["band_gap"]

0       1.88
1       3.52
2       1.17
3       1.12
4       2.87
        ... 
1051    0.87
1052    3.60
1053    0.14
1054    0.21
1055    0.26
Name: band_gap, Length: 1056, dtype: float64

You can also access multiple columns by indexing with a list of column names rather than a single column name:

Alternatively, we can access a particular row of a Dataframe using the iloc attribute.

df.iloc[100]

material_id                                                    mp-7140
formula                                                            SiC
nsites                                                               4
space_group                                                        186
volume                                                       42.005504
structure            [[-1.87933700e-06  1.78517223e+00  2.53458835e...
band_gap                                                           2.3
e_electronic         [[6.9589498, -3.29e-06, 0.0014472600000000001]...
e_total              [[10.193825310000001, -3.7090000000000006e-05,...
n                                                                 2.66
poly_electronic                                                   7.08
poly_total                                                       10.58
pot_ferroelectric                                                False
cif                  #\#CIF1.1\n###################################...
meta                 {u'incar': u'NELM = 100\nIBRION = 8\nLWAVE = F...
poscar               Si2 C2\n1.0\n3.092007 0.000000 0.000000\n-1.54...
Name: 100, dtype: object

Filtering the dataset¶

Pandas DataFrame objects make it very easy to filter the data based on a specific column. We can use the typical Python comparison operators (==, >, >=, <, etc) to filter numerical values. For example, let's find all entries where the cell volume is greater than 580. We do this by filtering on the volume column.

Note that we first produce a boolean mask – a series of True and False depending on the comparison. We can then use the mask to filter the DataFrame.

mask = df["volume"] >= 580
df[mask]

	material_id	formula	nsites	space_group	volume	structure	band_gap	e_electronic	e_total	n	poly_electronic	poly_total	pot_ferroelectric	cif	meta	poscar
206	mp-23280	AsCl3	16	19	582.085309	[[0.13113333 7.14863883 9.63476955] As, [2.457...	3.99	[[2.2839161900000002, 0.00014519, -2.238000000...	[[2.49739759, 0.00069379, 0.00075864], [0.0004...	1.57	2.47	3.30	False	#\#CIF1.1\n###################################...	{u'incar': u'NELM = 100\nIBRION = 8\nLWAVE = F...	As4 Cl12\n1.0\n4.652758 0.000000 0.000000\n0.0...
216	mp-9064	RbTe	12	189	590.136085	[[6.61780282 0. 0. ] Rb, [1.750...	0.43	[[3.25648277, 5.9650000000000007e-05, 1.57e-06...	[[5.34517928, 0.00022474000000000002, -0.00018...	2.05	4.20	6.77	False	#\#CIF1.1\n###################################...	{u'incar': u'NELM = 100\nIBRION = 8\nLWAVE = F...	Rb6 Te6\n1.0\n10.118717 0.000000 0.000000\n-5....
219	mp-23230	PCl3	16	62	590.637274	[[6.02561815 8.74038483 7.55586375] P, [2.7640...	4.03	[[2.39067769, 0.00017593, 8.931000000000001e-0...	[[2.80467218, 0.00034093000000000003, 0.000692...	1.52	2.31	2.76	False	#\#CIF1.1\n###################################...	{u'incar': u'NELM = 100\nIBRION = 8\nLWAVE = F...	P4 Cl12\n1.0\n6.523152 0.000000 0.000000\n0.00...
251	mp-2160	Sb2Se3	20	62	597.341134	[[3.02245275 0.42059268 1.7670481 ] Sb, [ 1.00...	0.76	[[19.1521058, 5.5e-06, 0.00025268], [-1.078000...	[[81.93819038000001, 0.0006755800000000001, 0....	3.97	15.76	63.53	True	#\#CIF1.1\n###################################...	{u'incar': u'NELM = 100\nIBRION = 8\nLWAVE = F...	Sb8 Se12\n1.0\n4.029937 0.000000 0.000000\n0.0...

We can use this method of filtering to clean our dataset. For example, if we only wanted our dataset to only include nonmetals (materials with a non-zero band gap), we can do this easily by filtering the band_gap column.

mask = df["band_gap"] > 0
nonmetal_df = df[mask]
nonmetal_df

	material_id	formula	nsites	space_group	volume	structure	band_gap	e_electronic	e_total	n	poly_electronic	poly_total	pot_ferroelectric	cif	meta	poscar
0	mp-441	Rb2Te	3	225	159.501208	[[1.75725875 1.2425695 3.04366125] Rb, [5.271...	1.88	[[3.44115795, -3.097e-05, -6.276e-05], [-2.837...	[[6.23414745, -0.00035252, -9.796e-05], [-0.00...	1.86	3.44	6.23	False	#\#CIF1.1\n###################################...	{u'incar': u'NELM = 100\nIBRION = 8\nLWAVE = F...	Rb2 Te1\n1.0\n5.271776 0.000000 3.043661\n1.75...
1	mp-22881	CdCl2	3	166	84.298097	[[0. 0. 0.] Cd, [ 4.27210959 2.64061969 13.13...	3.52	[[3.34688382, -0.04498543, -0.22379197], [-0.0...	[[7.97018673, -0.29423886, -1.463590159999999]...	1.78	3.16	6.73	False	#\#CIF1.1\n###################################...	{u'incar': u'NELM = 100\nIBRION = 8\nLWAVE = F...	Cd1 Cl2\n1.0\n3.850977 0.072671 5.494462\n1.78...
2	mp-28013	MnI2	3	164	108.335875	[[0. 0. 0.] Mn, [-2.07904300e-06 2.40067320e+...	1.17	[[5.5430849, -5.28e-06, -2.5030000000000003e-0...	[[13.80606079, 0.0006911900000000001, 9.655e-0...	2.23	4.97	10.64	False	#\#CIF1.1\n###################################...	{u'incar': u'NELM = 100\nIBRION = 8\nLWAVE = F...	Mn1 I2\n1.0\n4.158086 0.000000 0.000000\n-2.07...
3	mp-567290	LaN	4	186	88.162562	[[-1.73309900e-06 2.38611186e+00 5.95256328e...	1.12	[[7.09316738, 7.99e-06, -0.0003864700000000000...	[[16.79535386, 8.199999999999997e-07, -0.00948...	2.65	7.04	17.99	False	#\#CIF1.1\n###################################...	{u'incar': u'NELM = 100\nIBRION = 8\nLWAVE = F...	La2 N2\n1.0\n4.132865 0.000000 0.000000\n-2.06...
4	mp-560902	MnF2	6	136	82.826401	[[1.677294 2.484476 2.484476] Mn, [0. 0. 0.] M...	2.87	[[2.4239622, 7.452000000000001e-05, 6.06100000...	[[6.44055613, 0.0020446600000000002, 0.0013203...	1.53	2.35	7.12	False	#\#CIF1.1\n###################################...	{u'incar': u'NELM = 100\nIBRION = 8\nLDAUTYPE ...	Mn2 F4\n1.0\n3.354588 0.000000 0.000000\n0.000...
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
1051	mp-568032	Cd(InSe2)2	7	111	212.493121	[[0. 0. 0.] Cd, [2.9560375 0. 3.03973 ...	0.87	[[7.74896783, 0.0, 0.0], [0.0, 7.74896783, 0.0...	[[11.85159471, 1e-08, 0.0], [1e-08, 11.8515962...	2.77	7.67	11.76	True	#\#CIF1.1\n###################################...	{u'incar': u'NELM = 100\nIBRION = 8\nLWAVE = F...	Cd1 In2 Se4\n1.0\n5.912075 0.000000 0.000000\n...
1052	mp-696944	LaHBr2	8	194	220.041363	[[2.068917 3.58317965 3.70992025] La, [4.400...	3.60	[[4.40504391, 6.1e-07, 0.0], [6.1e-07, 4.40501...	[[8.77136355, 1.649999999999999e-06, 0.0], [1....	2.00	3.99	7.08	True	#\#CIF1.1\n###################################...	{u'incar': u'NELM = 100\nIBRION = 8\nLWAVE = F...	La2 H2 Br4\n1.0\n4.137833 0.000000 0.000000\n-...
1053	mp-16238	Li2AgSb	4	216	73.882306	[[1.35965225 0.96141925 2.354987 ] Li, [2.719...	0.14	[[212.60750153, -1.843e-05, 0.0], [-1.843e-05,...	[[232.59707383, -0.0005407400000000001, 0.0025...	14.58	212.61	232.60	True	#\#CIF1.1\n###################################...	{u'incar': u'NELM = 100\nIBRION = 8\nLWAVE = F...	Li2 Ag1 Sb1\n1.0\n4.078957 0.000000 2.354987\n...
1054	mp-4405	Rb3AuO	5	221	177.269065	[[0. 2.808758 2.808758] Rb, [2.808758 2....	0.21	[[6.40511712, 0.0, 0.0], [0.0, 6.40511712, 0.0...	[[22.43799785, 0.0, 0.0], [0.0, 22.4380185, 0....	2.53	6.41	22.44	True	#\#CIF1.1\n###################################...	{u'incar': u'NELM = 100\nIBRION = 8\nLWAVE = F...	Rb3 Au1 O1\n1.0\n5.617516 0.000000 0.000000\n0...
1055	mp-3486	KSnSb	6	186	227.725015	[[-1.89006800e-06 2.56736395e+00 1.32914373e...	0.26	[[13.85634957, 1.8e-06, 0.0], [1.8e-06, 13.856...	[[16.45311887, 1.64e-06, -0.00019139], [1.64e-...	3.53	12.47	15.55	True	#\#CIF1.1\n###################################...	{u'incar': u'NELM = 100\nIBRION = 8\nLWAVE = F...	K2 Sn2 Sb2\n1.0\n4.446803 0.000000 0.000000\n-...

1056 rows × 16 columns

Often, a dataset contains many additional columns that are not necessary for machine learning. Before we can train a model on the data, we need to remove any extraneous columns. We can remove whole columns from the dataset using the drop() function. This function can be used to drop both rows and columns.

The function takes a list of items to drop. For columns, this is column names whereas for rows it is the row number. Finally, the axis option specifies whether the data to drop is columns (1) or rows (0).

For example, to remove the nsites, space_group, e_electronic, and e_total columns, we can run:

cleaned_df = df.drop(["nsites", "space_group", "e_electronic", "e_total"], axis=1)

Let's examine the cleaned DataFrame to see that the columns have been removed.

cleaned_df.head()

	material_id	formula	volume	structure	band_gap	n	poly_electronic	poly_total	pot_ferroelectric	cif	meta	poscar
0	mp-441	Rb2Te	159.501208	[[1.75725875 1.2425695 3.04366125] Rb, [5.271...	1.88	1.86	3.44	6.23	False	#\#CIF1.1\n###################################...	{u'incar': u'NELM = 100\nIBRION = 8\nLWAVE = F...	Rb2 Te1\n1.0\n5.271776 0.000000 3.043661\n1.75...
1	mp-22881	CdCl2	84.298097	[[0. 0. 0.] Cd, [ 4.27210959 2.64061969 13.13...	3.52	1.78	3.16	6.73	False	#\#CIF1.1\n###################################...	{u'incar': u'NELM = 100\nIBRION = 8\nLWAVE = F...	Cd1 Cl2\n1.0\n3.850977 0.072671 5.494462\n1.78...
2	mp-28013	MnI2	108.335875	[[0. 0. 0.] Mn, [-2.07904300e-06 2.40067320e+...	1.17	2.23	4.97	10.64	False	#\#CIF1.1\n###################################...	{u'incar': u'NELM = 100\nIBRION = 8\nLWAVE = F...	Mn1 I2\n1.0\n4.158086 0.000000 0.000000\n-2.07...
3	mp-567290	LaN	88.162562	[[-1.73309900e-06 2.38611186e+00 5.95256328e...	1.12	2.65	7.04	17.99	False	#\#CIF1.1\n###################################...	{u'incar': u'NELM = 100\nIBRION = 8\nLWAVE = F...	La2 N2\n1.0\n4.132865 0.000000 0.000000\n-2.06...
4	mp-560902	MnF2	82.826401	[[1.677294 2.484476 2.484476] Mn, [0. 0. 0.] M...	2.87	1.53	2.35	7.12	False	#\#CIF1.1\n###################################...	{u'incar': u'NELM = 100\nIBRION = 8\nLDAUTYPE ...	Mn2 F4\n1.0\n3.354588 0.000000 0.000000\n0.000...

You can alternatively select multiple columns by passing in a list of column names as an index.

For example, if we're only interested in the band_gap and structure columns, we can index with ["band_gap", "structure"]

df[["band_gap", "structure"]]

	band_gap	structure
0	1.88	[[1.75725875 1.2425695 3.04366125] Rb, [5.271...
1	3.52	[[0. 0. 0.] Cd, [ 4.27210959 2.64061969 13.13...
2	1.17	[[0. 0. 0.] Mn, [-2.07904300e-06 2.40067320e+...
3	1.12	[[-1.73309900e-06 2.38611186e+00 5.95256328e...
4	2.87	[[1.677294 2.484476 2.484476] Mn, [0. 0. 0.] M...
...	...	...
1051	0.87	[[0. 0. 0.] Cd, [2.9560375 0. 3.03973 ...
1052	3.60	[[2.068917 3.58317965 3.70992025] La, [4.400...
1053	0.14	[[1.35965225 0.96141925 2.354987 ] Li, [2.719...
1054	0.21	[[0. 2.808758 2.808758] Rb, [2.808758 2....
1055	0.26	[[-1.89006800e-06 2.56736395e+00 1.32914373e...

1056 rows × 2 columns

Generating new columns¶

Pandas DataFrame objects also make it easy to perform simple calculations on the data. Think of this as using formulas in Excel spreadsheets. All fundamental Python math operators (such as +, -, /, and *) can be used.

For example, the dielectric dataset contains the electronic contribution to the dielectric constant (\(\epsilon_\mathrm{electronic}\), in the poly_electronic column) and the total (static) dielectric constant (\(\epsilon_\mathrm{total}\), in the poly_total column). The ionic contribution to the dataset is given by:

\[ \epsilon_\mathrm{ionic} = \epsilon_\mathrm{total} - \epsilon_\mathrm{electronic} \]

Below, we calculate the ionic contribution to the dielectric constant and store it in a new column called poly_ionic. This is as simple as assigning the data to the new column, even if the column doesn't already exist.

df["poly_ionic"] = df["poly_total"] - df["poly_electronic"]

Let's check the new data was added correctly.

df.head()

	material_id	formula	nsites	space_group	volume	structure	band_gap	e_electronic	e_total	n	poly_electronic	poly_total	pot_ferroelectric	cif	meta	poscar	poly_ionic
0	mp-441	Rb2Te	3	225	159.501208	[[1.75725875 1.2425695 3.04366125] Rb, [5.271...	1.88	[[3.44115795, -3.097e-05, -6.276e-05], [-2.837...	[[6.23414745, -0.00035252, -9.796e-05], [-0.00...	1.86	3.44	6.23	False	#\#CIF1.1\n###################################...	{u'incar': u'NELM = 100\nIBRION = 8\nLWAVE = F...	Rb2 Te1\n1.0\n5.271776 0.000000 3.043661\n1.75...	2.79
1	mp-22881	CdCl2	3	166	84.298097	[[0. 0. 0.] Cd, [ 4.27210959 2.64061969 13.13...	3.52	[[3.34688382, -0.04498543, -0.22379197], [-0.0...	[[7.97018673, -0.29423886, -1.463590159999999]...	1.78	3.16	6.73	False	#\#CIF1.1\n###################################...	{u'incar': u'NELM = 100\nIBRION = 8\nLWAVE = F...	Cd1 Cl2\n1.0\n3.850977 0.072671 5.494462\n1.78...	3.57
2	mp-28013	MnI2	3	164	108.335875	[[0. 0. 0.] Mn, [-2.07904300e-06 2.40067320e+...	1.17	[[5.5430849, -5.28e-06, -2.5030000000000003e-0...	[[13.80606079, 0.0006911900000000001, 9.655e-0...	2.23	4.97	10.64	False	#\#CIF1.1\n###################################...	{u'incar': u'NELM = 100\nIBRION = 8\nLWAVE = F...	Mn1 I2\n1.0\n4.158086 0.000000 0.000000\n-2.07...	5.67
3	mp-567290	LaN	4	186	88.162562	[[-1.73309900e-06 2.38611186e+00 5.95256328e...	1.12	[[7.09316738, 7.99e-06, -0.0003864700000000000...	[[16.79535386, 8.199999999999997e-07, -0.00948...	2.65	7.04	17.99	False	#\#CIF1.1\n###################################...	{u'incar': u'NELM = 100\nIBRION = 8\nLWAVE = F...	La2 N2\n1.0\n4.132865 0.000000 0.000000\n-2.06...	10.95
4	mp-560902	MnF2	6	136	82.826401	[[1.677294 2.484476 2.484476] Mn, [0. 0. 0.] M...	2.87	[[2.4239622, 7.452000000000001e-05, 6.06100000...	[[6.44055613, 0.0020446600000000002, 0.0013203...	1.53	2.35	7.12	False	#\#CIF1.1\n###################################...	{u'incar': u'NELM = 100\nIBRION = 8\nLDAUTYPE ...	Mn2 F4\n1.0\n3.354588 0.000000 0.000000\n0.000...	4.77

Part 2: Generating descriptors for machine learning¶

In this section, we will learn a bit about how to generate machine-learning descriptors from materials objects in pymatgen. First, we'll generate some descriptors with matminer's "featurizers" classes. Next, we'll use some of what we learned about dataframes in the previous section to examine our descriptors and prepare them for input to machine learning models.

featurizers overview

Featurizers transform materials primitives into machine-learnable features¶

The general idea of featurizers is that they accept a materials primitive (e.g., pymatgen Composition) and output a vector. For example:

\[\begin{align} f(\mathrm{Fe}_2\mathrm{O}_3) \rightarrow [1.5, 7.8, 9.1, 0.09] \end{align}\]

Matminer contains featurizers for the following pymatgen objects:¶

Composition
Crystal structure
Crystal sites
Bandstructure
Density of states

Depending on the featurizer, the features returned may be:¶

numerical, categorical, or mixed vectors
matrices
other pymatgen objects (for further processing)

Featurizers play nice with dataframes¶

Since most of the time we are working with pandas dataframes, all featurizers work natively with pandas dataframes. We'll provide examples of this later in the lesson.

Featurizers present in matminer¶

Matminer hosts over 60 featurizers, most of which are implemented from methods published in peer reviewed papers. You can find a full list of featurizers on the matminer website. All featurizers have parallelization and convenient error tolerance built into their core methods.

In this lesson, we'll go over the main methods present in all featurizers. By the end of this unit, you will be able to generate descriptors for a wide range of materials informatics problems using one common software interface.

The `featurize` method and basics¶

The core method of any matminer is "featurize". This method accepts a materials object and returns a machine learning vector or matrix. Let's see an example on a pymatgen composition:

from pymatgen.core import Composition

fe2o3 = Composition("Fe2O3")

As a trivial example, we'll get the element fractions with the ElementFraction featurizer.

from matminer.featurizers.composition.element import ElementFraction

ef = ElementFraction()

Now we can featurize our composition.

element_fractions = ef.featurize(fe2o3)

print(element_fractions)

[0, 0, 0, 0, 0, 0, 0, 0.6, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0.4, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]

We've managed to generate features for learning, but what do they mean? One way to check is by reading the Features section in the documentation of any featurizer... but a much easier way is to use the feature_labels() method.

element_fraction_labels = ef.feature_labels()
print(element_fraction_labels)

['H', 'He', 'Li', 'Be', 'B', 'C', 'N', 'O', 'F', 'Ne', 'Na', 'Mg', 'Al', 'Si', 'P', 'S', 'Cl', 'Ar', 'K', 'Ca', 'Sc', 'Ti', 'V', 'Cr', 'Mn', 'Fe', 'Co', 'Ni', 'Cu', 'Zn', 'Ga', 'Ge', 'As', 'Se', 'Br', 'Kr', 'Rb', 'Sr', 'Y', 'Zr', 'Nb', 'Mo', 'Tc', 'Ru', 'Rh', 'Pd', 'Ag', 'Cd', 'In', 'Sn', 'Sb', 'Te', 'I', 'Xe', 'Cs', 'Ba', 'La', 'Ce', 'Pr', 'Nd', 'Pm', 'Sm', 'Eu', 'Gd', 'Tb', 'Dy', 'Ho', 'Er', 'Tm', 'Yb', 'Lu', 'Hf', 'Ta', 'W', 'Re', 'Os', 'Ir', 'Pt', 'Au', 'Hg', 'Tl', 'Pb', 'Bi', 'Po', 'At', 'Rn', 'Fr', 'Ra', 'Ac', 'Th', 'Pa', 'U', 'Np', 'Pu', 'Am', 'Cm', 'Bk', 'Cf', 'Es', 'Fm', 'Md', 'No', 'Lr']

We now see the labels in the order that we generated the features.

print(element_fraction_labels[7], element_fractions[7])
print(element_fraction_labels[25], element_fractions[25])

O 0.6
Fe 0.4

Featurizing dataframes¶

We just generated some descriptors and their labels from an individual sample but most of the time our data is in pandas dataframes. Fortunately, matminer featurizers implement a featurize_dataframe() method which interacts natively with dataframes.

Let's grab a new dataset from matminer and use our ElementFraction featurizer on it.

First, we download a dataset as we did in the previous unit. In this example, we'll download a dataset of super hard materials.

from matminer.datasets.dataset_retrieval import load_dataset

df = load_dataset("brgoch_superhard_training")
df.head()

Fetching brgoch_superhard_training.json.gz from https://ndownloader.figshare.com/files/13858931 to data/brgoch_superhard_training.json.gz

Fetching https://ndownloader.figshare.com/files/13858931 in MB: 2.51904MB [00:00, 24.28MB/s]

	formula	bulk_modulus	shear_modulus	composition	material_id	structure	brgoch_feats	suspect_value
0	AlPt3	225.230461	91.197748	(Al, Pt)	mp-188	[[0. 0. 0.] Al, [0. 1.96140395 1.96140...	{'atomic_number_feat_1': 123.5, 'atomic_number...	False
1	Mn2Nb	232.696340	74.590157	(Mn, Nb)	mp-12659	[[-2.23765223e-08 1.42974191e+00 5.92614104e...	{'atomic_number_feat_1': 45.5, 'atomic_number_...	False
2	HfO2	204.573433	98.564374	(Hf, O)	mp-352	[[2.24450185 3.85793022 4.83390736] O, [2.7788...	{'atomic_number_feat_1': 44.0, 'atomic_number_...	False
3	Cu3Pt	159.312640	51.778816	(Cu, Pt)	mp-12086	[[0. 1.86144248 1.86144248] Cu, [1.861...	{'atomic_number_feat_1': 82.5, 'atomic_number_...	False
4	Mg3Pt	69.637565	27.588765	(Mg, Pt)	mp-18707	[[0. 0. 2.73626461] Mg, [0. ...	{'atomic_number_feat_1': 57.0, 'atomic_number_...	False

Next, we can use the featurize_dataframe() method (implemented by all featurizers) to apply ElementFraction to all of our data at once. The only required arguments are the dataframe as input and the input column name (in this case it is composition). featurize_dataframe() is parallelized by default using multiprocessing.

If we look at the dataframe, now we can see our new feature columns.

df = ef.featurize_dataframe(df, "composition")

df.head()

	formula	bulk_modulus	shear_modulus	composition	material_id	structure	brgoch_feats	suspect_value	...
0	AlPt3	225.230461	91.197748	(Al, Pt)	mp-188	[[0. 0. 0.] Al, [0. 1.96140395 1.96140...	{'atomic_number_feat_1': 123.5, 'atomic_number...	False	...
1	Mn2Nb	232.696340	74.590157	(Mn, Nb)	mp-12659	[[-2.23765223e-08 1.42974191e+00 5.92614104e...	{'atomic_number_feat_1': 45.5, 'atomic_number_...	False	...
2	HfO2	204.573433	98.564374	(Hf, O)	mp-352	[[2.24450185 3.85793022 4.83390736] O, [2.7788...	{'atomic_number_feat_1': 44.0, 'atomic_number_...	False	...
3	Cu3Pt	159.312640	51.778816	(Cu, Pt)	mp-12086	[[0. 1.86144248 1.86144248] Cu, [1.861...	{'atomic_number_feat_1': 82.5, 'atomic_number_...	False	...
4	Mg3Pt	69.637565	27.588765	(Mg, Pt)	mp-18707	[[0. 0. 2.73626461] Mg, [0. ...	{'atomic_number_feat_1': 57.0, 'atomic_number_...	False	...

5 rows × 111 columns

Structure Featurizers¶

We can use the same syntax for other kinds of featurizers. Let's now assign descriptors to a structure. We do this with the same syntax as the composition featurizers. First, let's load a dataset containing structures.

df = load_dataset("phonon_dielectric_mp")

df.head()

Fetching phonon_dielectric_mp.json.gz from https://ndownloader.figshare.com/files/13297571 to data/phonon_dielectric_mp.json.gz

Fetching https://ndownloader.figshare.com/files/13297571 in MB: 0.5038079999999999MB [00:00, 27.16MB/s]

	mpid	eps_electronic	eps_total	last phdos peak	structure	formula
0	mp-1000	6.311555	12.773454	98.585771	[[2.8943817 2.04663693 5.01321616] Te, [0. 0....	BaTe
1	mp-1002124	24.137743	32.965593	677.585725	[[0. 0. 0.] Hf, [-3.78195772 -3.78195772 -3.78...	HfC
2	mp-1002164	8.111021	11.169464	761.585719	[[0. 0. 0.] Ge, [ 3.45311592 3.45311592 -3.45...	GeC
3	mp-10044	10.032168	10.128936	701.585723	[[0.98372595 0.69559929 1.70386332] B, [0. 0. ...	BAs
4	mp-1008223	3.979201	6.394043	204.585763	[[0. 0. 0.] Ca, [ 4.95 4.95 -4.95] Se]	CaSe

Let's calculate some basic density features of these structures using DensityFeatures.

from matminer.featurizers.structure import DensityFeatures

densityf = DensityFeatures()
densityf.feature_labels()

['density', 'vpa', 'packing fraction']

These are the features we will get. Now we use featurize_dataframe() to generate these features for all the samples in the dataframe. Since we are using the structures as input to the featurizer, we select the "structure" column.

Let's examine the dataframe and see the structural features.

df = densityf.featurize_dataframe(df, "structure")

df.head()

	mpid	eps_electronic	eps_total	last phdos peak	structure	formula	density	vpa	packing fraction
0	mp-1000	6.311555	12.773454	98.585771	[[2.8943817 2.04663693 5.01321616] Te, [0. 0....	BaTe	4.937886	44.545547	0.596286
1	mp-1002124	24.137743	32.965593	677.585725	[[0. 0. 0.] Hf, [-3.78195772 -3.78195772 -3.78...	HfC	9.868234	16.027886	0.531426
2	mp-1002164	8.111021	11.169464	761.585719	[[0. 0. 0.] Ge, [ 3.45311592 3.45311592 -3.45...	GeC	5.760895	12.199996	0.394180
3	mp-10044	10.032168	10.128936	701.585723	[[0.98372595 0.69559929 1.70386332] B, [0. 0. ...	BAs	5.087634	13.991016	0.319600
4	mp-1008223	3.979201	6.394043	204.585763	[[0. 0. 0.] Ca, [ 4.95 4.95 -4.95] Se]	CaSe	2.750191	35.937000	0.428523

Conversion Featurizers¶

In addition to Bandstructure/DOS/Structure/Composition featurizers, matminer also provides a featurizer interface for converting between pymatgen objects (e.g., assinging oxidation states to compositions) in a fault-tolerant fashion. These featurizers are found in matminer.featurizers.conversion and work with the same featurize/featurize_dataframe etc. syntax as the other featurizers.

The dataset we loaded previously only contains a formula column with string objects. To convert this data into a composition column containing pymatgen Composition objects, we can use the StrToComposition conversion featurizer on the formula column.

from matminer.featurizers.conversions import StrToComposition

stc = StrToComposition()
df = stc.featurize_dataframe(df, "formula", pbar=False)

We can see a new composition column has been added to the dataframe.

df.head()

	mpid	eps_electronic	eps_total	last phdos peak	structure	formula	density	vpa	packing fraction	composition
0	mp-1000	6.311555	12.773454	98.585771	[[2.8943817 2.04663693 5.01321616] Te, [0. 0....	BaTe	4.937886	44.545547	0.596286	(Ba, Te)
1	mp-1002124	24.137743	32.965593	677.585725	[[0. 0. 0.] Hf, [-3.78195772 -3.78195772 -3.78...	HfC	9.868234	16.027886	0.531426	(Hf, C)
2	mp-1002164	8.111021	11.169464	761.585719	[[0. 0. 0.] Ge, [ 3.45311592 3.45311592 -3.45...	GeC	5.760895	12.199996	0.394180	(Ge, C)
3	mp-10044	10.032168	10.128936	701.585723	[[0.98372595 0.69559929 1.70386332] B, [0. 0. ...	BAs	5.087634	13.991016	0.319600	(B, As)
4	mp-1008223	3.979201	6.394043	204.585763	[[0. 0. 0.] Ca, [ 4.95 4.95 -4.95] Se]	CaSe	2.750191	35.937000	0.428523	(Ca, Se)

Advanced capabilities¶

There are powerful functionalities of Featurizers which are worth quickly mentioning before we go practice (and many more not mentioned here).

Dealing with Errors¶

Often, data is messy and certain featurizers will encounter errors. Set ignore_errors=True in featurize_dataframe() to skip errors; if you'd like to see the errors returned in an additional column, also set return_errors=True.

Citing the authors¶

Many featurizers are implemented using methods found in peer reviewed studies. Please cite these original works using the citations() method, which returns the BibTex-formatted references in a Python list. For example:

from matminer.featurizers.structure.bonding import GlobalInstabilityIndex


GlobalInstabilityIndex().citations()

/usr/share/miniconda/envs/workshop/lib/python3.9/site-packages/pandas/io/parsers/readers.py:586: ParserWarning: Length of header or names does not match length of data. This leads to a loss of data with index_col=False.
  return _read(filepath_or_buffer, kwds)

['@article{PhysRevB.87.184115,title = {Structural characterization of R2BaCuO5 (R = Y, Lu, Yb, Tm, Er, Ho, Dy, Gd, Eu and Sm) oxides by X-ray and neutron diffraction},author = {Salinas-Sanchez, A. and Garcia-Muñoz, J.L. and Rodriguez-Carvajal, J. and Saez-Puche, R. and Martinez, J.L.},journal = {Journal of Solid State Chemistry},volume = {100},issue = {2},pages = {201-211},year = {1992},doi = {10.1016/0022-4596(92)90094-C},url = {https://doi.org/10.1016/0022-4596(92)90094-C}}']

Part 3: Machine learning models¶

In parts 1 and 2, we demonstrated how to download a dataset and add machine learnable features. In part 3, we show how to train a machine learning model on a dataset and analyze the results.

Scikit-Learn¶

This unit makes extensive use of the scikit-learn package, an open-source python package for machine learning. Matminer has been designed to make machine learning with scikit-learn as easy as possible. Other machine learning packages exist, such as TensorFlow, which implement neural network architectures. These packages can also be used with matminer but are outside the scope of this workshop.

Load and prepare a pre-featurized model¶

First, let's load a dataset that we can use for machine learning. In advance, we've added some composition and structure features to the elastic_tensor_2015 dataset used in exercises 1 and 2.

import os
from matminer.utils.io import load_dataframe_from_json

df = load_dataframe_from_json(os.path.join("resources", "elastic_tensor_2015_featurized.json"))
df.head()

Reading file resources/elastic_tensor_2015_featurized.json: 2362it [00:01, 1717.57it/s]
Decoding objects from resources/elastic_tensor_2015_featurized.json: 100%|##########| 2362/2362 [00:01<00:00, 1718.68it/s]

	structure	formula	K_VRH	composition	MagpieData minimum Number	MagpieData maximum Number	MagpieData range Number	MagpieData mean Number	MagpieData avg_dev Number	MagpieData mode Number	...	MagpieData minimum SpaceGroupNumber	MagpieData maximum SpaceGroupNumber	MagpieData range SpaceGroupNumber	MagpieData mean SpaceGroupNumber	MagpieData avg_dev SpaceGroupNumber	MagpieData mode SpaceGroupNumber	density	vpa	packing fraction
0	[[0.94814328 2.07280467 2.5112 ] Nb, [5.273...	Nb4CoSi	194.268884	(Nb, Co, Si)	14.0	41.0	27.0	34.166667	9.111111	41.0	...	194.0	229.0	35.0	222.833333	9.611111	229.0	7.834556	16.201654	0.688834
1	[[0. 0. 0.] Al, [1.96639263 1.13529553 0.75278...	Al(CoSi)2	175.449907	(Al, Co, Si)	13.0	27.0	14.0	19.000000	6.400000	14.0	...	194.0	227.0	33.0	213.400000	15.520000	194.0	5.384968	12.397466	0.644386
2	[[1.480346 1.480346 1.480346] Si, [0. 0. 0.] Os]	SiOs	295.077545	(Si, Os)	14.0	76.0	62.0	45.000000	31.000000	14.0	...	194.0	227.0	33.0	210.500000	16.500000	194.0	13.968635	12.976265	0.569426
3	[[0. 1.09045794 0.84078375] Ga, [0. ...	Ga	49.130670	(Ga)	31.0	31.0	0.0	31.000000	0.000000	31.0	...	64.0	64.0	0.0	64.000000	0.000000	64.0	6.036267	19.180359	0.479802
4	[[1.0094265 4.24771709 2.9955487 ] Si, [3.028...	SiRu2	256.768081	(Si, Ru)	14.0	44.0	30.0	34.000000	13.333333	44.0	...	194.0	227.0	33.0	205.000000	14.666667	194.0	9.539514	13.358418	0.598395

5 rows × 139 columns

We first need to split the dataset into the "target" property, and the "features" used for learning. In this model, we will be using the bulk modulus (K_VRH) as the target property. We use the values attribute of the dataframe to give the target properties a numpy array, rather than pandas Series object.

y = df['K_VRH'].values

print(y)

[194.26888436 175.44990675 295.07754499 ...  89.41816126  99.3845653
  35.93865993]

The machine learning algorithm can only use numerical features for training. Accordingly, we need to remove any non-numerical columns from our dataset. Additionally, we want to remove the K_VRH column from the set of features, as the model should not know about the target property in advance.

The dataset loaded above, includes structure, formula, and composition columns that were previously used to generate the machine learnable features. Let's remove them using the pandas drop() function, discussed in unit 1. Remember, axis=1 indicates we are dropping columns rather than rows.

X = df.drop(["structure", "formula", "composition", "K_VRH"], axis=1)

We can see all the descriptors in model using the column attribute.

print("There are {} possible descriptors:".format(len(X.columns)))
print(X.columns)

There are 135 possible descriptors:
Index(['MagpieData minimum Number', 'MagpieData maximum Number',
       'MagpieData range Number', 'MagpieData mean Number',
       'MagpieData avg_dev Number', 'MagpieData mode Number',
       'MagpieData minimum MendeleevNumber',
       'MagpieData maximum MendeleevNumber',
       'MagpieData range MendeleevNumber', 'MagpieData mean MendeleevNumber',
       ...
       'MagpieData mode GSmagmom', 'MagpieData minimum SpaceGroupNumber',
       'MagpieData maximum SpaceGroupNumber',
       'MagpieData range SpaceGroupNumber', 'MagpieData mean SpaceGroupNumber',
       'MagpieData avg_dev SpaceGroupNumber',
       'MagpieData mode SpaceGroupNumber', 'density', 'vpa',
       'packing fraction'],
      dtype='object', length=135)

Try a random forest model using scikit-learn¶

The scikit-learn library makes it easy to use our generated features for training machine learning models. It implements a variety of different regression models and contains tools for cross-validation.

In the interests of time, in this example we will only trial a single model but it is good practice to trial multiple models to see which performs best for your machine learning problem. A good "starting" model is the random forest model. Let's create a random forest model.

from sklearn.ensemble import RandomForestRegressor

rf = RandomForestRegressor(n_estimators=100, random_state=1)

Notice we created the model with the number of estimators (n_estimators) set to 100. n_estimators is an example of a machine learning hyper-parameter. Most models contain many tunable hyper-parameters. To obtain good performance, it is necessary to fine tune these parameters for each individual machine learning problem. There is currently no simple way to know in advance what hyper-parameters will be optimal. Usually, a trial and error approach is used.

We can now train our model to use the input features (X) to predict the target property (y). This is achieved using the fit() function.

rf.fit(X, y)

RandomForestRegressor(random_state=1)

That's it, we have trained our first machine learning model!

Evaluating model performance¶

Next, we need to assess how the model is performing. To do this, we first ask the model to predict the bulk modulus for every entry in our original dataframe.

y_pred = rf.predict(X)

Next, we can check the accuracy of our model by looking at the root mean squared error of our predictions. Scikit-learn provides a mean_squared_error() function to calculate the mean squared error. We then take the square-root of this to obtain our final performance metric.

import numpy as np
from sklearn.metrics import mean_squared_error

mse = mean_squared_error(y, y_pred)
print('training RMSE = {:.3f} GPa'.format(np.sqrt(mse)))

training RMSE = 7.272 GPa

An RMSE of 7.2 GPa looks very reasonable! However, as the model was trained and evaluated on exactly the same data, this is not a true estimate of how the model will perform for unseen materials (the primary purpose of machine learning studies).

Cross validation¶

To obtain a more accurate estimate of prediction performance and validate that we are not over-fitting, we need to check the cross-validation score rather than the fitting score.

In cross-validation, the data is partitioned randomly into \(n\) "splits" (in this case 10), each containing roughly the same number of samples. The model is trained on \(n-1\) splits (the training set) and the model performance evaluated by comparing the actual and predicted values for the final split (the testing set). In total, this process is repeated \(n\) times, such that each split is at some point used as the testing set. The cross-validation score is the average score across all testing sets.

There are a number of ways to partition the data into splits. In this example, we use the KFold method and select the number of splits to be 10. I.e., 90 % of the data will be used as the training set, with 10 % used as the testing set.

from sklearn.model_selection import KFold

kfold = KFold(n_splits=10, random_state=1, shuffle=True)

Note, we set random_state=1 to ensure every attendee gets the same answer for their model.

Finally, obtaining the cross validation score can be automated using the Scikit-Learn cross_val_score() function. This function requires a machine learning model, the input features, and target property as arguments. Note, we pass the kfold object as thecv argument, to make cross_val_score() use the correct test/train splits.

For each split, the model will be trained from scratch, before the performance is evaluated. As we have to train and predict 10 times, cross validation can often take some time to perform. In our case, the model is quite small, so the process only takes about a minute. The final cross validation score is the average across all splits.

from sklearn.model_selection import cross_val_score

scores = cross_val_score(rf, X, y, scoring='neg_mean_squared_error', cv=kfold)

rmse_scores = [np.sqrt(abs(s)) for s in scores]
print('Mean RMSE: {:.3f}'.format(np.mean(rmse_scores)))

Mean RMSE: 18.903

Notice that our RMSE has almost tripled as now it reflects the true predictive power of the model. However, a root-mean-squared error of ~19 GPa is still not bad!

Visualizing model performance¶

We can visualize the predictive performance of our model by plotting the our predictions against the actual value, for each sample in the test set for all test/train splits. First, we get the predicted values of the testing set for each split using the cross_val_predict method. This is similar to the cross_val_score method, except it returns the actual predictions, rather than the model score.

from sklearn.model_selection import cross_val_predict

y_pred = cross_val_predict(rf, X, y, cv=kfold)

Let's now add our predicted values to our dataframe and calculate an absolute percentage error for each sample.

We can do this conveniently for all of our samples with the dataframe columns.

If we scroll to the end of the dataframe, our predicted K_VRH and percentage errors are given for each sample. This might allow us to examine manually which samples are performing well and which are performing poorly.

df["K_VRH predicted"] = y_pred
df["percentage_error"] = (df["K_VRH"] - df["K_VRH predicted"]).abs()/df["K_VRH"] * 100

df

	structure	formula	K_VRH	composition	MagpieData minimum Number	MagpieData maximum Number	MagpieData range Number	MagpieData mean Number	MagpieData avg_dev Number	MagpieData mode Number	...	MagpieData maximum SpaceGroupNumber	MagpieData range SpaceGroupNumber	MagpieData mean SpaceGroupNumber	MagpieData avg_dev SpaceGroupNumber	MagpieData mode SpaceGroupNumber	density	vpa	packing fraction	K_VRH predicted	percentage_error
0	[[0.94814328 2.07280467 2.5112 ] Nb, [5.273...	Nb4CoSi	194.268884	(Nb, Co, Si)	14.0	41.0	27.0	34.166667	9.111111	41.0	...	229.0	35.0	222.833333	9.611111	229.0	7.834556	16.201654	0.688834	199.911721	2.904653
1	[[0. 0. 0.] Al, [1.96639263 1.13529553 0.75278...	Al(CoSi)2	175.449907	(Al, Co, Si)	13.0	27.0	14.0	19.000000	6.400000	14.0	...	227.0	33.0	213.400000	15.520000	194.0	5.384968	12.397466	0.644386	166.865227	4.892952
2	[[1.480346 1.480346 1.480346] Si, [0. 0. 0.] Os]	SiOs	295.077545	(Si, Os)	14.0	76.0	62.0	45.000000	31.000000	14.0	...	227.0	33.0	210.500000	16.500000	194.0	13.968635	12.976265	0.569426	254.852161	13.632140
3	[[0. 1.09045794 0.84078375] Ga, [0. ...	Ga	49.130670	(Ga)	31.0	31.0	0.0	31.000000	0.000000	31.0	...	64.0	0.0	64.000000	0.000000	64.0	6.036267	19.180359	0.479802	50.252618	2.283600
4	[[1.0094265 4.24771709 2.9955487 ] Si, [3.028...	SiRu2	256.768081	(Si, Ru)	14.0	44.0	30.0	34.000000	13.333333	44.0	...	227.0	33.0	205.000000	14.666667	194.0	9.539514	13.358418	0.598395	227.636131	11.345627
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
1176	[[1.55171489 0.89588144 8.41159136] Ti, [-1.88...	Ti2CdC	111.788114	(Ti, Cd, C)	6.0	48.0	42.0	24.500000	11.750000	22.0	...	194.0	0.0	194.000000	0.000000	194.0	6.016734	15.190017	0.658712	155.019117	38.672271
1177	[[0. 8.53417539 0.91740975] Sc, [0. ...	ScSi	101.326807	(Sc, Si)	14.0	21.0	7.0	17.500000	3.500000	14.0	...	227.0	33.0	210.500000	16.500000	194.0	3.336856	18.174013	0.625414	111.135312	9.680069
1178	[[0. 9.084549 0.960921] Y, [0. 1.4...	YSi	89.418161	(Y, Si)	14.0	39.0	25.0	26.500000	12.500000	14.0	...	227.0	33.0	210.500000	16.500000	194.0	4.462773	21.765469	0.689264	94.369733	5.537546
1179	[[5.11035838 2.07486738 0. ] Al, [3.996...	Al2Cu	99.384565	(Al, Cu)	13.0	29.0	16.0	18.333333	7.111111	13.0	...	225.0	0.0	225.000000	0.000000	225.0	4.356814	14.928982	0.595452	98.320922	1.070230
1180	[[0. 0. 0.] V, [0. 2.824189 0. ] Cu...	VCu3Se4	35.938660	(V, Cu, Se)	23.0	34.0	11.0	30.750000	3.250000	34.0	...	229.0	215.0	120.000000	106.000000	14.0	5.136414	22.525854	0.370166	65.607360	82.553718

1181 rows × 141 columns

A more convient way of examining our model's performance is by creating a graph comparing our cross-validation predicted bulk modulus to the actual bulk modulus for every sample. Here, we use plotly.express from the Plotly package to create our graphs.

Plotly Express is designed to create many kinds of plots directly from dataframes. Since we already have our data inside a dataframe, we can specify the column names to tell Plotly the data we'd like to show.

We make two series of data: - First, a reference line indicating "perfect" peformance of the model. - Second, a scatter plot of the predicted K_VRH vs the actual K_VRH for every sample.

import plotly.express as px
import plotly.graph_objects as go

reference_line = go.Scatter(
    x=[0, 400],
    y=[0, 400],
    line=dict(color="black", dash="dash"),
    mode="lines",
    showlegend=False
)

fig = px.scatter(
    df, 
    x="K_VRH", 
    y="K_VRH predicted", 
    hover_name="formula", 
    color="percentage_error", 
    color_continuous_scale=px.colors.sequential.Bluered,
)

fig.add_trace(reference_line)
fig.show()

Not too bad! However, there are definitely some outliers (you can hover over the points with your mouse to see what they are).

Model interpretation¶

An important aspect of machine learning is being able to understand why a model is making certain predictions. Random forest models are particularly amenable to interpretation as they possess a feature_importances attribute, which contains the importance of each feature in deciding the final prediction. Let's look at the feature importances of our model.

rf.feature_importances_

array([2.77737706e-04, 6.86497802e-04, 4.19014378e-04, 9.38306579e-04,
       6.27788172e-04, 8.11685429e-04, 5.85205797e-03, 4.13582985e-04,
       5.68896006e-03, 3.99784395e-03, 2.29068933e-03, 4.00079790e-03,
       2.80170565e-04, 1.12108238e-03, 5.04161260e-04, 7.21240737e-04,
       6.66258777e-04, 2.65253041e-04, 6.31863354e-02, 2.61334748e-02,
       1.69886830e-03, 5.43940586e-01, 3.79683746e-03, 1.84024489e-03,
       2.00724094e-02, 2.74160018e-04, 2.89023628e-04, 1.65750614e-03,
       1.70010289e-03, 8.44812934e-03, 4.07321100e-05, 4.20522484e-05,
       8.69871167e-05, 1.12538000e-03, 1.05324451e-03, 4.16679765e-05,
       3.96170134e-04, 1.48598578e-03, 6.16768132e-04, 3.24518244e-03,
       8.40211726e-04, 6.08207520e-04, 2.48464536e-03, 5.38387869e-04,
       9.44676379e-04, 1.17935220e-02, 1.54290424e-03, 3.42464373e-03,
       1.62873316e-04, 4.59487184e-05, 1.77379083e-04, 1.35098601e-03,
       3.88133058e-04, 3.33992917e-04, 2.17490708e-04, 3.12382175e-05,
       1.01795637e-04, 1.03213685e-03, 6.85435551e-04, 6.34344688e-05,
       2.02095869e-04, 2.66249318e-03, 3.74273103e-04, 7.21229249e-04,
       1.29571361e-03, 3.56619677e-04, 3.67797732e-06, 1.81902930e-05,
       1.66200350e-05, 2.20302332e-04, 1.52030919e-04, 1.48360193e-05,
       1.71809628e-03, 1.19092123e-03, 3.73628049e-04, 1.42148596e-03,
       6.57524332e-04, 6.28622233e-03, 1.16750735e-05, 4.36732403e-05,
       3.66858164e-05, 1.34265640e-03, 1.90009692e-04, 1.65555892e-04,
       5.84467323e-05, 2.62242035e-04, 3.16564734e-04, 4.75664484e-04,
       9.69183532e-04, 2.32465824e-05, 3.02435786e-04, 3.16506500e-04,
       1.37421635e-03, 1.55517264e-03, 4.35381175e-03, 1.13587974e-03,
       0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00,
       0.00000000e+00, 0.00000000e+00, 1.49055670e-03, 1.23382329e-03,
       1.39319470e-03, 5.25514449e-03, 1.45239205e-03, 2.80369169e-03,
       6.08487913e-04, 1.41780570e-02, 1.86595558e-03, 2.05837632e-02,
       2.18064410e-03, 3.69109295e-03, 8.52970476e-05, 4.96849240e-05,
       1.33715940e-04, 9.60365387e-04, 5.01421614e-04, 2.88261104e-04,
       1.88675594e-05, 9.08252204e-05, 1.04200208e-04, 4.82154177e-04,
       2.80548603e-04, 6.50107456e-05, 2.28368912e-04, 5.33462112e-04,
       3.00309809e-04, 1.14474431e-03, 1.02300569e-03, 3.41368312e-04,
       2.72149182e-02, 1.34503923e-01, 6.79013491e-03])

To make sense of this, we need to know which feature each number corresponds to. We can use PlotlyFig to plot the importances of the 5 most important features.

importances = rf.feature_importances_
included = X.columns.values
indices = np.argsort(importances)[::-1]

fig_bar = px.bar(
    x=included[indices][0:5], 
    y=importances[indices][0:5], 
    title="Feature Importances of Random Forest",
    labels={"x": "Feature", "y": "Importance"}
)
fig_bar.show()

Bonus: Curated ML datasets with Matbench¶

If you are interested in comparing your machine learning algorithms with the state of the art, matminer also offers access to a curated set of 13 benchmarking datasets called Matbench, which have been used to benchmark SoTA algorithms like RooSt, CGCNN, CRABNet, MEGNet, Automatminer, and more.

matbench

The Matbench datasets span a wide variety of materials informatics tasks such as:

Predicting materials properties given only composition, or given composition and structure
Predicting a wide array of target properties, such as elastic constants, dielectric constants, formation energies, and steel yield strength
Data-sparse tasks (300 samples) and (relatively) data-rich tasks (100k+ samples)
Both regression and classification tasks

The full set of datasets is given in the table below:

Task name	Task type	Target column (unit)	Input type	Samples	MAD (regression) or Fraction True (classification)	Links
`matbench_steels`	regression	`yield strength` (MPa)	composition	312	229.3743	download, interactive
`matbench_jdft2d`	regression	`exfoliation_en` (meV/atom)	structure	636	67.2020	download, interactive
`matbench_phonons`	regression	`last phdos peak` (cm^-1)	structure	1,265	323.7870	download, interactive
`matbench_expt_gap`	regression	`gap expt` (eV)	composition	4,604	1.1432	download, interactive
`matbench_dielectric`	regression	`n` (unitless)	structure	4,764	0.8085	download, interactive
`matbench_expt_is_metal`	classification	`is_metal`	composition	4,921	0.4981	download, interactive
`matbench_glass`	classification	`gfa`	composition	5,680	0.7104	download, interactive
`matbench_log_gvrh`	regression	`log10(G_VRH)` (log10(GPa))	structure	10,987	0.2931	download, interactive
`matbench_log_kvrh`	regression	`log10(K_VRH)` (log10(GPa))	structure	10,987	0.2897	download, interactive
`matbench_perovskites`	regression	`e_form` (eV/unit cell)	structure	18,928	0.5660	download, interactive
`matbench_mp_gap`	regression	`gap pbe` (eV)	structure	106,113	1.3271	download, interactive
`matbench_mp_is_metal`	classification	`is_metal`	structure	106,113	0.4349	download, interactive
`matbench_mp_e_form`	regression	`e_form` (eV/atom)	structure	132,752	1.0059	download, interactive

You can click on the "download" links above to download the raw files, or alternatively click on "interactive" to explore the datasets on MPContribs (note: you must be logged in to MPContribs to see the data).

matbench

Loading Matbench datasets with Matminer¶

We can easily load and examine any of these datasets using the exact same load_dataset function we have used throughout this lesson in matminer. Let's look at the smallest of the matbench datasets, matbench_steels, which has 312 samples and contains chemical compositions and yield strengths in MPa as the target property.

from matminer.datasets import load_dataset

df = load_dataset("matbench_steels")

Fetching matbench_steels.json.gz from https://ml.materialsproject.org/projects/matbench_steels.json.gz to data/matbench_steels.json.gz

Fetching https://ml.materialsproject.org/projects/matbench_steels.json.gz in MB: 0.010239999999999999MB [00:00,  4.46MB/s]

All Matbench datasets are reduced to the minimum possible format, including a single column for input (either structure or composition) and a single target property (a number for regression or a bool for binary classification).

If we use the get_all_dataset_info we can examine some more about where this data came from, how it was edited into this form, as well as the meaning of columns.

from matminer.datasets import get_all_dataset_info


info = get_all_dataset_info("matbench_steels")

print(info)

Dataset: matbench_steels
Description: Matbench v0.1 test dataset for predicting steel yield strengths from chemical composition alone. Retrieved from Citrine informatics. Deduplicated. For benchmarking w/ nested cross validation, the order of the dataset must be identical to the retrieved data; refer to the Automatminer/Matbench publication for more details.
Columns:
    composition: Chemical formula.
    yield strength: Target variable. Experimentally measured steel yield strengths, in MPa.
Num Entries: 312
Reference: https://citrination.com/datasets/153092/
Bibtex citations: ["@Article{Dunn2020,\nauthor={Dunn, Alexander\nand Wang, Qi\nand Ganose, Alex\nand Dopp, Daniel\nand Jain, Anubhav},\ntitle={Benchmarking materials property prediction methods: the Matbench test set and Automatminer reference algorithm},\njournal={npj Computational Materials},\nyear={2020},\nmonth={Sep},\nday={15},\nvolume={6},\nnumber={1},\npages={138},\nabstract={We present a benchmark test suite and an automated machine learning procedure for evaluating supervised machine learning (ML) models for predicting properties of inorganic bulk materials. The test suite, Matbench, is a set of 13{\\thinspace}ML tasks that range in size from 312 to 132k samples and contain data from 10 density functional theory-derived and experimental sources. Tasks include predicting optical, thermal, electronic, thermodynamic, tensile, and elastic properties given a material's composition and/or crystal structure. The reference algorithm, Automatminer, is a highly-extensible, fully automated ML pipeline for predicting materials properties from materials primitives (such as composition and crystal structure) without user intervention or hyperparameter tuning. We test Automatminer on the Matbench test suite and compare its predictive power with state-of-the-art crystal graph neural networks and a traditional descriptor-based Random Forest model. We find Automatminer achieves the best performance on 8 of 13 tasks in the benchmark. We also show our test suite is capable of exposing predictive advantages of each algorithm---namely, that crystal graph methods appear to outperform traditional machine learning methods given {\\textasciitilde}104 or greater data points. We encourage evaluating materials ML algorithms on the Matbench benchmark and comparing them against the latest version of Automatminer.},\nissn={2057-3960},\ndoi={10.1038/s41524-020-00406-3},\nurl={https://doi.org/10.1038/s41524-020-00406-3}\n}\n", '@misc{Citrine Informatics,\ntitle = {Mechanical properties of some steels},\nhowpublished = {\\url{https://citrination.com/datasets/153092/},\n}']
File type: json.gz
Figshare URL: https://ml.materialsproject.org/projects/matbench_steels.json.gz
SHA256 Hash Digest: 473bc4957b2ea5e6465aef84bc29bb48ac34db27d69ea4ec5f508745c6fae252

The Matbench Leaderboard and Benchmarking Code¶

We host an online benchmark leaderboard - similar to an "ImageNet" for materials science - at the following URL:

https://hackingmaterials.lbl.gov/matbench ¶

Which contains comprehensive data on various SoTA algorithm's performance across tasks in Matbench. On the website you can find:

A general purpose leaderboard comparing only the most-widely applicable algorithms
Individual per-task (per-dataset) leaderboards for comparing any ML model on a particular task
Comprehensive breakdowns of cross-validation performance, statistics, and metadata for every model
Access to individual sample predictions for each and every submission

matbench

General purpose leaderboard¶

Task name	Samples	Algorithm	Verified MAE (unit) or ROCAUC	Notes
matbench_steels	312	AMMExpress v2020	97.4929 (MPa)
matbench_jdft2d	636	AMMExpress v2020	39.8497 (meV/atom)
matbench_phonons	1,265	CrabNet	55.1114 (cm^-1)
matbench_expt_gap	4,604	CrabNet	0.3463 (eV)
matbench_dielectric	4,764	AMMExpress v2020	0.3150 (unitless)
matbench_expt_is_metal	4,921	AMMExpress v2020	0.9209
matbench_glass	5,680	AMMExpress v2020	0.8607
matbench_log_gvrh	10,987	AMMExpress v2020	0.0874 (log10(GPa))
matbench_log_kvrh	10,987	AMMExpress v2020	0.0647 (log10(GPa))
matbench_perovskites	18,928	CGCNN v2019	0.0452 (eV/unit cell)	structure required
matbench_mp_gap	106,113	CrabNet	0.2655 (eV)
matbench_mp_is_metal	106,113	CGCNN v2019	0.9520	structure required
matbench_mp_e_form	132,752	CGCNN v2019	0.0337 (eV/atom)	structure required

Submit your algorithm¶

Although outside the scope of this lesson, you can submit your own algorithms to Matbench via Github and appear on the leaderboards! For more information on how to do this, see the documentation on how to record results and submit to the leaderboard.

Materials data science: descriptors and machine learning¶

Machine learning workflow¶

Part 1: Data retrieval and filtering¶

Manipulating and examining pandas DataFrame objects¶

Inspecting the dataset¶

Indexing the dataset¶

Filtering the dataset¶

Generating new columns¶

Part 2: Generating descriptors for machine learning¶

Featurizers transform materials primitives into machine-learnable features¶

Matminer contains featurizers for the following pymatgen objects:¶

Depending on the featurizer, the features returned may be:¶

Featurizers play nice with dataframes¶

Featurizers present in matminer¶

The featurize method and basics¶

Featurizing dataframes¶

Structure Featurizers¶

Conversion Featurizers¶

Advanced capabilities¶

Dealing with Errors¶

Citing the authors¶

Part 3: Machine learning models¶

Scikit-Learn¶

Load and prepare a pre-featurized model¶

Try a random forest model using scikit-learn¶

Evaluating model performance¶

Cross validation¶

Visualizing model performance¶

Model interpretation¶

Bonus: Curated ML datasets with Matbench¶

Loading Matbench datasets with Matminer¶

The Matbench Leaderboard and Benchmarking Code¶

https://hackingmaterials.lbl.gov/matbench¶

General purpose leaderboard¶

Submit your algorithm¶

Manipulating and examining pandas `DataFrame` objects¶

The `featurize` method and basics¶

https://hackingmaterials.lbl.gov/matbench ¶