Lugano tutorial: Dimensionality reduction

Aims

This tutorial will show you how to you can use PLUMED to perform dimensionality reduction. The tutorial will try not to focus on the application of one particular algorithm but will instead try to show you the principles behind the implementation of these algorithms that has been adopted within PLUMED. By the end of the tutorial you will thus be able to design your own dimensionality reduction algorithm.

Objectives

Once this tutorial is completed students will

  • Be able to use COLLECT_FRAMES to store a trajectory for later analysis
  • Be able to use PCA to perform principal component analysis
  • Be able to construct a dissimilarity matrix using EUCLIDEAN_DISSIMILARITIES
  • Be able to select a subset of landmark points to analyze with particular dimensionality reduction algorithm.
  • Be able to construct low dimensional representations using CLASSICAL_MDS and SKETCH_MAP.
  • Be able to generate projections of non-landmark points by using PROJECT_ALL_ANALYSIS_DATA

Resources

The TARBALL for this project contains the following files:

  • beta-hairpin.pdb : A pdb file containing the protein that we are going to study in this tutorial in a beta hairpin configuration. This input will be used as a template so that we can use the names of special groups in many of the inputs that follow.

In addition, you will also need to get a copy of the trajectory that we will be analyzing in this tutorial by executing the following command:

wget https://github.com/plumed/lugano2019/raw/master/handson_5/traj.dcd

The trajectory we are analyzing is a smaller version of the trajectory that was analyzed in the following paper:

In this paper the trajectory was analyzed with a variety of different dimensionality reduction algorithms and the results were compared. The paper may, therefore, be of interest.

This tutorial has been tested on v2.5 but it should also work with other versions of PLUMED.

Also notice that the .solutions direction of the tarball contains correct input files for the exercises. Please only look at these files once you have tried to solve the problems yourself. Similarly the tutorial below contains questions for you to answer that are shown in bold. You can reveal the answers to these questions by clicking on links within the tutorial but you should obviously try to answer things yourself before reading these answers.

Introduction

In all of the previous tutorials we have used functions that take the position of all the atoms in the system - a \(3N\) dimensional vector, where \(N\) is the number of atoms as input. This function then outputs a single number - the value of the collective variable - that tells us where in a low dimensional space we should project that configuration. Problems can arise because this collective-variable function is many-to-one and it may thus be difficult to distinguish between every different pair of structurally distinct conformers of our system.

In this tutorial we are going to introduce an alternative approach to this business of finding collective variables. In this alternative approach we are going to stop trying to seek out a function that can take any configuration of the atoms (any \(3N\)-dimensional vector) and find its low dimensional projection on the collective variable axis. Instead we are going to take a set of configurations of the atoms (a set of \(3N\)-dimensional vectors of atom positions) and try to find a sensible set of projections for these configurations. We are going to find this low dimensional representation by seeking out an isometry between the space containing the \(3N\)-dimensional vectors of atom positions and some lower-dimensional space. This idea is explained in more detail in the following video and details on the various algorithms that we are using in the tutorial can be found in:

As you will find out if you read the chapter that is linked above there are multiple ways to construct an isometric embedding of a trajectory. This tutorial will thus try to teach you a set of basic ideas and will then encourage you to experiment and to develop your own strategy for representing the data set.

Exercises

Collecting the trajectory

The first thing that we need to learn to do in order to run these dimensionality reduction algorithms is to store the trajectory so that we can analyze in later. The following input (once the blanks are filled in) will take the positions of the non-hydrogen atoms in our protein and store them every 1 step in an object that we can refer to later in the input using the label data. All the configurations stored in data will then be output to a pdb file once the whole trajectory is read in. Fill in the blanks in the input below now:

Click on the labels of the actions for more information on what each action computes
tested on v2.9
# This reads in the template pdb file and thus allows us to use the @nonhydrogens
# special group later in the input
MOLINFO STRUCTUREcompulsory keyword 
a file in pdb format containing a reference structure. =__FILL__ MOLTYPEcompulsory keyword ( default=protein )
what kind of molecule is contained in the pdb file - usually not needed since protein/RNA/DNA
are compatible =protein 
# This stores the positions of all the nonhydrogen atoms for later analysis
cc: COLLECT_FRAMES __FILL__ could not find this keyword =@nonhydrogens 
# This should output the atomic positions for the frames that were collected to a pdb file called traj.pdb
OUTPUT_ANALYSIS_DATA_TO_PDB USE_OUTPUT_DATA_FROMuse the output of the analysis performed by this object as input to your new analysis
object =__FILL__ FILEcompulsory keyword 
the name of the file to output to =traj.pdb 

Then, once all the blanks are filled in, run the command using:

plumed driver --mf_dcd traj.dcd

Notice that the above input stored the atomic positions of the atoms. We can use the atomic positions in many of the dimensionality reductions that will be discussed later in this tutorial or we can use a high-dimensional vector of collective variables. The following input thus gives an example of which shows you can compute and store the values the Ramachandran angles of the protein took in all the trajectory frames so that they can be analyzed using a dimensionality reduction algorithm. Try to fill in the blanks on this input and then run this form of analysis on the trajectory using the command above once more:

Click on the labels of the actions for more information on what each action computes
tested on v2.9
# This reads in the template pdb file and thus allows us to use the @nonhydrogens
# special group later in the input
MOLINFO STRUCTUREcompulsory keyword 
a file in pdb format containing a reference structure. =__FILL__ MOLTYPEcompulsory keyword ( default=protein )
what kind of molecule is contained in the pdb file - usually not needed since protein/RNA/DNA
are compatible =protein 
# The following commands compute all the Ramachandran angles of the protein for you
r2-phi: TORSION ATOMSthe four atoms involved in the torsional angle =@phi-2 
r2-psi: TORSION ATOMSthe four atoms involved in the torsional angle =@psi-2 
r3-phi: TORSION ATOMSthe four atoms involved in the torsional angle =@phi-3 
r3-psi: TORSION ATOMSthe four atoms involved in the torsional angle =@psi-3 
r4-phi: TORSION __FILL__
r4-psi: TORSION __FILL__
r5-phi: TORSION __FILL__
r5-psi: TORSION __FILL__
r6-phi: TORSION __FILL__
r6-psi: TORSION __FILL__
r7-phi: TORSION __FILL__
r7-psi: TORSION __FILL__
r8-phi: TORSION __FILL__
r8-psi: TORSION __FILL__
r9-phi: TORSION __FILL__
r9-psi: TORSION __FILL__
r10-phi: TORSION __FILL__
r10-psi: TORSION __FILL__
r11-phi: TORSION __FILL__
r11-psi: TORSION __FILL__
r12-phi: TORSION __FILL__
r12-psi: TORSION __FILL__
r13-phi: TORSION __FILL__
r13-psi: TORSION __FILL__
r14-phi: TORSION __FILL__
r14-psi: TORSION __FILL__
r15-phi: TORSION __FILL__
r15-psi: TORSION __FILL__
r16-phi: TORSION __FILL__
r16-psi: TORSION __FILL__
# This command stores all the Ramachandran angles that were computed
cc: COLLECT_FRAMES __FILL__ could not find this keyword =r2-phi,r2-psi,r3-phi,r3-psi,r4-phi,r4-psi,r5-phi,r5-psi,r6-phi,r6-psi,r7-phi,r7-psi,r8-phi,r8-psi,r9-phi,r9-psi,r10-phi,r10-psi,r11-phi,r11-psi,r12-phi,r12-psi,r13-phi,r13-psi,r14-phi,r14-psi,r15-phi,r15-psi,r16-phi,r16-psi 
# This command outputs all the Ramachandran angles that were stored to a file called angles_data
OUTPUT_ANALYSIS_DATA_TO_COLVAR USE_OUTPUT_DATA_FROMuse the output of the analysis performed by this object as input to your new analysis
object =__FILL__ ARGthe input for this action is the scalar output from one or more other actions. =cc.* FILEcompulsory keyword 
the name of the file to output to =angles_data 

Performing PCA

Having learned how to store data for later analysis with a dimensionality reduction algorithm lets now apply principal component analysis (PCA) upon our stored data. In principal component analysis a low dimensional projections for our trajectory are constructed by:

  • Computing a covariance matrix from the trajectory data
  • Diagonalizing the covariance matrix.
  • Calculating the projection of each trajectory frame on a subset of the eigenvectors of the covariance matrix.

To perform PCA using PLUMED we are going to use the following input with the blanks filled in:

Click on the labels of the actions for more information on what each action computes
tested on v2.9
# This reads in the template pdb file and thus allows us to use the @nonhydrogens
# special group later in the input
MOLINFO STRUCTUREcompulsory keyword 
a file in pdb format containing a reference structure. =__FILL__ MOLTYPEcompulsory keyword ( default=protein )
what kind of molecule is contained in the pdb file - usually not needed since protein/RNA/DNA
are compatible =protein 

# This stores the positions of all the nonhydrogen atoms for later analysis
cc: COLLECT_FRAMES __FILL__ could not find this keyword =@nonhydrogens 
# This diagonalizes the covariance matrix
pca: PCA USE_OUTPUT_DATA_FROMuse the output of the analysis performed by this object as input to your new analysis
object =__FILL__ METRICcompulsory keyword ( default=EUCLIDEAN )
the method that you are going to use to measure the distances between points =OPTIMAL NLOW_DIMcompulsory keyword 
number of low-dimensional coordinates required =2 
# This projects each of the trajectory frames onto the low dimensional space that was
# identified by the PCA command
dat: PROJECT_ALL_ANALYSIS_DATA USE_OUTPUT_DATA_FROMuse the output of the analysis performed by this object as input to your new analysis
object =__FILL__ PROJECTIONcompulsory keyword 
the projection that you wish to generate out-of-sample projections with =__FILL__ 
# This should output the atomic positions for the frames that were collected and analyzed using PCA
OUTPUT_ANALYSIS_DATA_TO_PDB USE_OUTPUT_DATA_FROMuse the output of the analysis performed by this object as input to your new analysis
object =__FILL__ FILEcompulsory keyword 
the name of the file to output to =traj.pdb 
# This should output the PCA projections of all the coordinates
OUTPUT_ANALYSIS_DATA_TO_COLVAR USE_OUTPUT_DATA_FROMuse the output of the analysis performed by this object as input to your new analysis
object =__FILL__ ARGthe input for this action is the scalar output from one or more other actions. =dat.* FILEcompulsory keyword 
the name of the file to output to =pca_data 
# These next three commands calculate the secondary structure variables. These
# variables measure how much of the structure resembles an alpha helix, an antiparallel beta sheet
# and a parallel beta sheet. Configurations that have different secondary structures should be projected
# in different parts of the low dimensional space.
alpha: ALPHARMSD RESIDUESthis command is used to specify the set of residues that could conceivably form part
of the secondary structure. =all 
abeta: ANTIBETARMSD RESIDUESthis command is used to specify the set of residues that could conceivably form part
of the secondary structure. =all STRANDS_CUTOFFIf in a segment of protein the two strands are further apart then the calculation
of the actual RMSD is skipped as the structure is very far from being beta-sheet
like. =1.0 
pbeta: PARABETARMSD RESIDUESthis command is used to specify the set of residues that could conceivably form part
of the secondary structure. =all STRANDS_CUTOFFIf in a segment of protein the two strands are further apart then the calculation
of the actual RMSD is skipped as the structure is very far from being beta-sheet
like. =1.0 
# These commands collect and output the secondary structure variables so that we can use this information to
# determine how good our projection of the trajectory data is.
cc2: COLLECT_FRAMES ARGthe input for this action is the scalar output from one or more other actions. =alpha,abeta,pbeta 
OUTPUT_ANALYSIS_DATA_TO_COLVAR USE_OUTPUT_DATA_FROMuse the output of the analysis performed by this object as input to your new analysis
object =cc2 ARGthe input for this action is the scalar output from one or more other actions. =cc2.* FILEcompulsory keyword 
the name of the file to output to =secondary_structure_data 

To generate the projection you run the command:

plumed driver --mf_dcd traj.dcd

I would recommend visualizing this data using the GISMO plugin to VMD. You can find instructions on how to compile this code on the page below:

http://epfl-cosmo.github.io/sketchmap/index.html?page=code

(you don't need to compile the sketch-map code) Once GISMO is installed you should have an option to open it when you open vmd. The option to open GISMO can be found under Extensions>Analysis>GISMO. To visualize the results from what we have just done you should need to follow the following instructions:

  • Open vmd and load the pdb file that was output: traj.pdb
  • Open GISMO and load the pca projections file: pca_data
  • Open GISMO and load the secondary structure variables: secondary_structure_data
  • You can safely ignore the error message that GISMO will give at this stage.
  • Now choose to plot the quantities dat.coord-1 and dat.coord-2 on the x and y axis respectively. Color the points using cc2.alpha.

If you follow the instructions above you should get an image like the one shown below:

Figure created using GISMO that shows where each frame of the trajectory is projected in the low-dimensional space. Points are colored in accordance with the alpha helical content of the structure.

You can click on the various points in the plot and VMD will show you the structure in the corresponding trajectory frame. Furthermore, you can get a particularly useful representation of the structures by adding the following text to your ~/.vmdrc file:

user add key m {
  puts "Automatic update of secondary structure, and alignment to first frame"
  trace variable vmd_frame w structure_trace
  rmsdtt
  rmsdtt::doAlign
  destroy $::rmsdtt::w
  clear_reps top
  mol color Structure
  mol selection backbone
  mol representation NewCartoon
  mol addrep top
}

With this text in your ~/.vmdrc file VMD will align all the structures with the first frame and then show the cartoon representation of each structure when you press the m button on your keyboard

Performing MDS

In the previous section we performed PCA on the atomic positions directly. In the section before last, however, we also saw how we can store high-dimensional vectors of collective variables and then use these vectors as input to a dimensionality reduction algorithm. We might legitimately ask, therefore, if we can do PCA using these high-dimensional vectors as input rather than atomic positions. The answer to this question is yes as long as the CV is not periodic. If any of our CVs are not periodic we cannot analyze them using the PCA action. We can, however, formulate the PCA algorithm in a different way. In this alternative formulation, which is known as classical multidimensional scaling (MDS) we do the following:

  • We calculate the matrix of distances between configurations
  • We perform an operation known as centering the matrix.
  • We diagonalize the centered matrix
  • The eigenvectors multiplied by the square root of the corresponding eigenvalue can then be used as a set of projections for our input points.

This method is used less often the PCA as the matrix that we have to diagonalize here in the third step can be considerably larger than the matrix that we have to diagonalize when we perform PCA. In fact in order to avoid this expensive diagonalization step we often select a subset of so called landmark points on which to run the algorithm directly. Projections for the remaining points are then found by using a so-called out-of-sample procedure. This is what has been done in the following input:

Click on the labels of the actions for more information on what each action computes
tested on v2.9
THIS reads could not find this keyword  in could not find this keyword  the could not find this keyword  template could not find this keyword  pdb could not find this keyword  file could not find this keyword  and could not find this keyword  thus could not find this keyword  allows could not find this keyword  us could not find this keyword  to could not find this keyword  use could not find this keyword  the could not find this keyword  @nonhydrogens could not find this keyword  
# special group later in the input
MOLINFO STRUCTUREcompulsory keyword 
a file in pdb format containing a reference structure. =beta-hairpin.pdb MOLTYPEcompulsory keyword ( default=protein )
what kind of molecule is contained in the pdb file - usually not needed since protein/RNA/DNA
are compatible =protein 
# This stores the positions of all the nonhydrogen atoms for later analysis
cc: COLLECT_FRAMES ATOMSthe atoms whose positions we are tracking for the purpose of analyzing the data =@nonhydrogens 
# This should output the atomic positions for the frames that were collected and analyzed using MDS
OUTPUT_ANALYSIS_DATA_TO_PDB USE_OUTPUT_DATA_FROMuse the output of the analysis performed by this object as input to your new analysis
object =cc FILEcompulsory keyword 
the name of the file to output to =traj.pdb 
# The following commands compute all the Ramachandran angles of the protein for you
r2-phi: TORSION ATOMSthe four atoms involved in the torsional angle =@phi-2 
r2-psi: TORSION ATOMSthe four atoms involved in the torsional angle =@psi-2 
r3-phi: TORSION ATOMSthe four atoms involved in the torsional angle =@phi-3 
r3-psi: TORSION ATOMSthe four atoms involved in the torsional angle =@psi-3 
r4-phi: TORSION __FILL__
r4-psi: TORSION __FILL__
r5-phi: TORSION __FILL__
r5-psi: TORSION __FILL__
r6-phi: TORSION __FILL__
r6-psi: TORSION __FILL__
r7-phi: TORSION __FILL__
r7-psi: TORSION __FILL__
r8-phi: TORSION __FILL__
r8-psi: TORSION __FILL__
r9-phi: TORSION __FILL__
r9-psi: TORSION __FILL__
r10-phi: TORSION __FILL__
r10-psi: TORSION __FILL__
r11-phi: TORSION __FILL__
r11-psi: TORSION __FILL__
r12-phi: TORSION __FILL__
r12-psi: TORSION __FILL__
r13-phi: TORSION __FILL__
r13-psi: TORSION __FILL__
r14-phi: TORSION __FILL__
r14-psi: TORSION __FILL__
r15-phi: TORSION __FILL__
r15-psi: TORSION __FILL__
r16-phi: TORSION __FILL__
r16-psi: TORSION __FILL__
# This command stores all the Ramachandran angles that were computed
angles: COLLECT_FRAMES __FILL__ could not find this keyword =r2-phi,r2-psi,r3-phi,r3-psi,r4-phi,r4-psi,r5-phi,r5-psi,r6-phi,r6-psi,r7-phi,r7-psi,r8-phi,r8-psi,r9-phi,r9-psi,r10-phi,r10-psi,r11-phi,r11-psi,r12-phi,r12-psi,r13-phi,r13-psi,r14-phi,r14-psi,r15-phi,r15-psi,r16-phi,r16-psi 
# Lets now compute the matrix of distances between the frames in the space of the Ramachandran angles
distmat: EUCLIDEAN_DISSIMILARITIES USE_OUTPUT_DATA_FROMuse the output of the analysis performed by this object as input to your new analysis
object =__FILL__ METRICcompulsory keyword ( default=EUCLIDEAN )
the method that you are going to use to measure the distances between points =EUCLIDEAN 
# Now select 500 landmark points to analyze
fps: LANDMARK_SELECT_FPS USE_OUTPUT_DATA_FROMuse the output of the analysis performed by this object as input to your new analysis
object =__FILL__ NLANDMARKScompulsory keyword 
the number of landmarks that you would like to select =500 
# Run MDS on the landmarks
mds: CLASSICAL_MDS __FILL__ could not find this keyword =fps NLOW_DIMcompulsory keyword 
number of low-dimensional coordinates required =2 
# Project the remaining trajectory data
osample: PROJECT_ALL_ANALYSIS_DATA USE_OUTPUT_DATA_FROMuse the output of the analysis performed by this object as input to your new analysis
object =__FILL__ PROJECTIONcompulsory keyword 
the projection that you wish to generate out-of-sample projections with =__FILL__ 
# This command outputs all the projections of all the points in the low dimensional space
OUTPUT_ANALYSIS_DATA_TO_COLVAR USE_OUTPUT_DATA_FROMuse the output of the analysis performed by this object as input to your new analysis
object =__FILL__ ARGthe input for this action is the scalar output from one or more other actions. =osample.* FILEcompulsory keyword 
the name of the file to output to =mds_data 
# These next three commands calculate the secondary structure variables. These
# variables measure how much of the structure resembles an alpha helix, an antiparallel beta sheet
# and a parallel beta sheet. Configurations that have different secondary structures should be projected
# in different parts of the low dimensional space.
alpha: ALPHARMSD RESIDUESthis command is used to specify the set of residues that could conceivably form part
of the secondary structure. =all 
abeta: ANTIBETARMSD RESIDUESthis command is used to specify the set of residues that could conceivably form part
of the secondary structure. =all STRANDS_CUTOFFIf in a segment of protein the two strands are further apart then the calculation
of the actual RMSD is skipped as the structure is very far from being beta-sheet
like. =1.0 
pbeta: PARABETARMSD RESIDUESthis command is used to specify the set of residues that could conceivably form part
of the secondary structure. =all STRANDS_CUTOFFIf in a segment of protein the two strands are further apart then the calculation
of the actual RMSD is skipped as the structure is very far from being beta-sheet
like. =1.0 
# These commands collect and output the secondary structure variables so that we can use this information to
# determine how good our projection of the trajectory data is.
cc2: COLLECT_FRAMES ARGthe input for this action is the scalar output from one or more other actions. =alpha,abeta,pbeta 
OUTPUT_ANALYSIS_DATA_TO_COLVAR USE_OUTPUT_DATA_FROMuse the output of the analysis performed by this object as input to your new analysis
object =cc2 ARGthe input for this action is the scalar output from one or more other actions. =cc2.* FILEcompulsory keyword 
the name of the file to output to =secondary_structure_data 

This input collects all the torsional angles for the configurations in the trajectory. Then, at the end of the calculation, the matrix of distances between these points is computed and a set of landmark points is selected using a method known as farthest point sampling. A matrix that contains only those distances between the landmarks is then constructed and diagonalized by the CLASSICAL_MDS action so that projections of the landmarks can be constructed. The final step is then to project the remainder of the trajectory using the PROJECT_ALL_ANALYSIS_DATA action. Try to fill in the blanks in the input above and run this calculation now using the command:

plumed driver --mf_dcd traj.dcd

Once the calculation has completed you can, once again, visualize the data generated using the GISMO plugin.

Performing sketch-map

The two algorithms (PCA and MDS) that we have looked at thus far are both linear dimensionality reduction algorithms. In addition to these there are a whole class of non-linear dimensionality reduction reduction algorithms which work by transforming the matrix of dissimilarities between configurations, calculating geodesic rather than Euclidean distances between configurations or by changing the form of the loss function that is optimized. In this final exercise we are going to use an algorithm that uses the last of the these three strategies to construct a non-linear projection. The algorithm is known as sketch-map and an input for sketch-map is provided below:

Click on the labels of the actions for more information on what each action computes
tested on v2.9
# This reads in the template pdb file and thus allows us to use the @nonhydrogens
# special group later in the input
MOLINFO STRUCTUREcompulsory keyword 
a file in pdb format containing a reference structure. =__FILL__ MOLTYPEcompulsory keyword ( default=protein )
what kind of molecule is contained in the pdb file - usually not needed since protein/RNA/DNA
are compatible =protein 
# This stores the positions of all the nonhydrogen atoms for later analysis
cc: COLLECT_FRAMES __FILL__ could not find this keyword =@nonhydrogens 
# This should output the atomic positions for the frames that were collected and analyzed using MDS
OUTPUT_ANALYSIS_DATA_TO_PDB USE_OUTPUT_DATA_FROMuse the output of the analysis performed by this object as input to your new analysis
object =__FILL__ FILEcompulsory keyword 
the name of the file to output to =traj.pdb 
# The following commands compute all the Ramachandran angles of the protein for you
r2-phi: TORSION ATOMSthe four atoms involved in the torsional angle =@phi-2 
r2-psi: TORSION ATOMSthe four atoms involved in the torsional angle =@psi-2 
r3-phi: TORSION ATOMSthe four atoms involved in the torsional angle =@phi-3 
r3-psi: TORSION ATOMSthe four atoms involved in the torsional angle =@psi-3 
r4-phi: TORSION __FILL__
r4-psi: TORSION __FILL__
r5-phi: TORSION __FILL__
r5-psi: TORSION __FILL__
r6-phi: TORSION __FILL__
r6-psi: TORSION __FILL__
r7-phi: TORSION __FILL__
r7-psi: TORSION __FILL__
r8-phi: TORSION __FILL__
r8-psi: TORSION __FILL__
r9-phi: TORSION __FILL__
r9-psi: TORSION __FILL__
r10-phi: TORSION __FILL__
r10-psi: TORSION __FILL__
r11-phi: TORSION __FILL__
r11-psi: TORSION __FILL__
r12-phi: TORSION __FILL__
r12-psi: TORSION __FILL__
r13-phi: TORSION __FILL__
r13-psi: TORSION __FILL__
r14-phi: TORSION __FILL__
r14-psi: TORSION __FILL__
r15-phi: TORSION __FILL__
r15-psi: TORSION __FILL__
r16-phi: TORSION __FILL__
r16-psi: TORSION __FILL__
# This command stores all the Ramachandran angles that were computed
angles: COLLECT_FRAMES __FILL__ could not find this keyword =r2-phi,r2-psi,r3-phi,r3-psi,r4-phi,r4-psi,r5-phi,r5-psi,r6-phi,r6-psi,r7-phi,r7-psi,r8-phi,r8-psi,r9-phi,r9-psi,r10-phi,r10-psi,r11-phi,r11-psi,r12-phi,r12-psi,r13-phi,r13-psi,r14-phi,r14-psi,r15-phi,r15-psi,r16-phi,r16-psi 
# Lets now compute the matrix of distances between the frames in the space of the Ramachandran angles
distmat: EUCLIDEAN_DISSIMILARITIES USE_OUTPUT_DATA_FROMuse the output of the analysis performed by this object as input to your new analysis
object =__FILL__ METRICcompulsory keyword ( default=EUCLIDEAN )
the method that you are going to use to measure the distances between points =EUCLIDEAN 
# Now select 500 landmark points to analyze
fps: LANDMARK_SELECT_FPS USE_OUTPUT_DATA_FROMuse the output of the analysis performed by this object as input to your new analysis
object =__FILL__ NLANDMARKScompulsory keyword 
the number of landmarks that you would like to select =500 
# Run sketch-map on the landmarks
smap: SKETCH_MAP __FILL__ could not find this keyword =fps NLOW_DIMcompulsory keyword 
the dimension of the low dimensional space in which the projections will be constructed
=2 HIGH_DIM_FUNCTIONcompulsory keyword 
the parameters of the switching function in the high dimensional space ={SMAP R_0=6 A=8 B=2}  LOW_DIM_FUNCTIONcompulsory keyword 
the parameters of the switching function in the low dimensional space ={SMAP R_0=6 A=2 B=2}  CGTOLcompulsory keyword ( default=1E-6 )
the tolerance for the conjugate gradient minimization =1E-3 CGRID_SIZEcompulsory keyword ( default=10 )
number of points to use in each grid direction =20 FGRID_SIZEcompulsory keyword ( default=0 )
interpolate the grid onto this number of points -- only works in 2D =200 ANNEAL_STEPScompulsory keyword ( default=10 )
the number of steps of annealing to do =0 
# Project the remaining trajectory data
osample: PROJECT_ALL_ANALYSIS_DATA USE_OUTPUT_DATA_FROMuse the output of the analysis performed by this object as input to your new analysis
object =__FILL__ PROJECTIONcompulsory keyword 
the projection that you wish to generate out-of-sample projections with =__FILL__ 
# This command outputs all the projections of all the points in the low dimensional space
OUTPUT_ANALYSIS_DATA_TO_COLVAR USE_OUTPUT_DATA_FROMuse the output of the analysis performed by this object as input to your new analysis
object =__FILL__ ARGthe input for this action is the scalar output from one or more other actions. =osample.* FILEcompulsory keyword 
the name of the file to output to =smap_data 
# These next three commands calculate the secondary structure variables. These
# variables measure how much of the structure resembles an alpha helix, an antiparallel beta sheet
# and a parallel beta sheet. Configurations that have different secondary structures should be projected
# in different parts of the low dimensional space.
alpha: ALPHARMSD RESIDUESthis command is used to specify the set of residues that could conceivably form part
of the secondary structure. =all 
abeta: ANTIBETARMSD RESIDUESthis command is used to specify the set of residues that could conceivably form part
of the secondary structure. =all STRANDS_CUTOFFIf in a segment of protein the two strands are further apart then the calculation
of the actual RMSD is skipped as the structure is very far from being beta-sheet
like. =1.0 
pbeta: PARABETARMSD RESIDUESthis command is used to specify the set of residues that could conceivably form part
of the secondary structure. =all STRANDS_CUTOFFIf in a segment of protein the two strands are further apart then the calculation
of the actual RMSD is skipped as the structure is very far from being beta-sheet
like. =1.0 
# These commands collect and output the secondary structure variables so that we can use this information to
# determine how good our projection of the trajectory data is.
cc2: COLLECT_FRAMES ARGthe input for this action is the scalar output from one or more other actions. =alpha,abeta,pbeta 
OUTPUT_ANALYSIS_DATA_TO_COLVAR USE_OUTPUT_DATA_FROMuse the output of the analysis performed by this object as input to your new analysis
object =cc2 ARGthe input for this action is the scalar output from one or more other actions. =cc2.* FILEcompulsory keyword 
the name of the file to output to =secondary_structure_data 

This input collects all the torsional angles for the configurations in the trajectory. Then, at the end of the calculation, the matrix of distances between these points is computed and a set of landmark points is selected using a method known as farthest point sampling. A matrix that contains only those distances between the landmarks is then constructed and diagonalized by the CLASSICAL_MDS action and this set of projections is used as the initial configuration for the various minimization algorithms that are then used to optimize the sketch-map stress function. As in the previous exercise once the projections of the landmarks are found the projections for the remainder of the points in the trajectory are found by using the PROJECT_ALL_ANALYSIS_DATA action. Try to fill in the blanks in the input above and run this calculation now using the command:

plumed driver --mf_dcd traj.dcd

Once the calculation has completed you can, once again, visualize the data generated using the GISMO plugin.

Conclusions and extensions

This tutorial shown you that running dimensionality reduction algorithms using PLUMED involves the following stages:

There are multiple choices to be made in each of the various stages described above. For example, you can change the particular sort of data this is collected from the trajectory, there are multiple different ways to select landmarks, you can use the distances directly or you can transform them, you can use various different loss function and you can optimize the loss function using a variety of different algorithms. In this final exercise of the tutorial I thus want you to experiment with these various different choices that can be made. Use the data set that we have been working with throughout this tutorial and try to construct an interesting representation of it using some combination of Actions that we have not explored in the tutorial. Some things you can perhaps try:

  • Try sketch-map with RMSD distances as input rather than angles
  • Try using different Landmark Selection algorithms
  • Try using different numbers of landmarks
  • Try to use PCA followed by sketch-map
  • See if you can work out how to draw contour plot showing the free energy as a function of the low-dimensional coordinates.