Classification and Search

HIV-Capsid shape search: Preliminary results

Context

In 1999, Barbie Ganser et al. discovered the HIV-1 virus capsid to have fullerene structure A decade later, using our programs and applying rules of fullerene structure by hand, Zhao et al. found a couple of fullerene structures (two C₃₉₂ and C₄₅₂ nanocones) that were good fits for HIV capsid shapes discovered by cryo-ET microscopy, and were able to simulate an all-atom model by molecular dynamics. Ganser and Pornillos modeled a HIV-1 capsid as a C₃₅₂ structure. A later study by Mattei et al (2017) performed high-resolution cryo-tomography on 539 live HIV vira, showing that HIV capsids form a variety of fullerene nanocones. In particular, they propose specific C₃₈₄ and C₄₃₈ structures to match two of the tomograms.

There are 40,806,395,661 disctinct C₃₅₂-fullerenes, 109,837,310,021 distinct C₃₉₂ structures, and more than 2×10¹¹ C₄₃₈. Are the proposed structures for the HIV capsid really the correct ones, out of many millions of candidates, or merely similar? The published capsid structures were all constructed using rules of thumb and fitting. An exhaustive search for the actual structures has not been attempted before.

This project will use our programs to exhaustively generate all fullerene structures of the relevant sizes, and out of the hundreds of billions of structures, rank them by closeness to the experimental data. The goal is to find the set of exact structures of the observed HIV capsids. While the previously proposed structures are certainly similar in shape, the differences in structure to the exact ones may not be unimportant: The exact structures carry information about the rules that can lead to their construction. With a large selection of exact structures, we can then investigate:

How are the HIV capsid structures related to each other? What are the invariants, and what differences will still lead to a viable virus?
Can we use this information to infer rules for how the virus capsids are constructed?
Can we discover information from this about the function of the live virus: entry, opening/uncoating of the capsid, and delivery of the genetic payload?
Can schemes be devised that will perturb the construction in a simple manner, to make the vira build faulty capsids that will not for valid fullerenes and hence close up properly? Can common structural invariants be exploited to devise attacks on live vira?

Preliminary Work

Trial Data: What Structures Were Reported Found By Mattei et. al, 2017?

A later study by Mattei et al (2017) performed high-resolution cryo-tomography on 539 live HIV vira, showing that HIV capsids form a variety of fullerene nanocones.

Fig 1.1 Mattei et al., 2017: Slice through a representative tomographic reconstruction of HIV-1 virions. Scale bar is 100nm.

In particular, they propose specific C₃₈₄ and C₄₃₈ structures to match two of the tomograms: One C₄₃₈-structure (comprising 1314 CA protein structures), and one C₄₁₄-structure (comprising 1152 CA protein structures). They attempted to place hexagons and pentagons by numerical heuristics described in the paper, and Fig. 1.2 below (Fig 4 in their paper). They describe the method as follows:

(A) Computational slices through the tomographic reconstructions of four HIV-1 par- ticles, two with complete conical cores [...] superimposed with the position and orientation of each aligned hexameric and pentameric unit, revealing the fullerene structure. The CTDs are colored gray; the NTDs of the hexameric units are colored according to the quality of their alignment from red (for low cross-correlation values) to green (for high cross-correlation values); the NTDs of the pentameric units are depicted in blue. [...]

Fig 1.2 Mattei et al., 2017: (left) Structure (A): A C₄₃₈ structure (1314 CA protein structures). (right) Structure (B): A C₃₆₀ structure (1152 CA protein structures).

Since the sharp bends in the surface occurs only at pentagon positions, these and their relative distances are well-determined from the tomographic data. However, the full structure is not.

Methodology

The entire isomer space of C₄₃₈, consisting of approx 3×10¹¹ distinct structures (and their enantiomers for chiral structures) were generated and ranked in closeness by the methods described below: First by a coarse-grained filtering according to the Eisenstein pattern of the top cap (which Mattei et. al report with highest confidence, marking every face as bright green). Classifying the entire isomer space required about two months of computation time on 120 CPU cores(AMD Opteron 6272, 2.1GHz). About 4 million structures matched the 5-pentagon cap pattern.

The second-highest confidence was reported for the bottom cap. A second fine-detail search ranked all the 4 million remaining structures according to the inter-pentagon surface distances within the bottom cap (i.e., not exact paths, but pair-wise relative distances, a more flexible measure), as well as the cone length. The fine-grained ranking completed in less than two hours on a consumer laptop with one Intel i7-6600U 2.6GHz CPU.

Of all 3×10¹¹ fullerenes with the correct hexagon and pentagon count, none exactly matched the structures proposed in the Science paper by Mattei et al.; I.e., the proposed structure was not valid. However, 89 nearest-neighbour fullerene structures were found, exactly matching the reported 5-pentagon top cap, and at $d = 0.0500313$ away from the reported structure for the bottom cap in the metric described below. The 10,000 nearest structures were ranked, ranging from $d=0.0500313$ to $d=0.72219$. These can be downloaded here: HIV-C438-10k-best.txt.

Further analysis will be performed both on the 4 million structures matching the top cap, and of the 10,000 nearest structures.

Coarse-grained sorting: Fast fullerene pattern-matching

The first step is to generate the entire isomer space of 3×10¹¹ distinct C₄₃₈ graphs, and filter by matching paths in the Eisenstein plane that stay entirely within green regions in the top cap shown in Figure 1.2(A) (making sure to check both chiralities). Enumerating the pentagons in clockwise order, starting from the leftmost one, Pentagon 1 is connected to the four others in the cap as follows:

pattern_t HIV_C438_top {{ {/*1-1*/{},
    /*1-2*/{{0,1},{2,1}}, /*1-3*/{{0,2},{2,1}}, /*1-4*/{{0,2},{4,1}},
    /*1-5*/{{0,4},{5,3}}} }};

Here, the paths are represented as lists of pairs, the first member of which is the direction (0-5, the number of rotations from "straight ahead" in ascending counter-clockwise order), and the second is the number of steps to take.

When matching, each pentagon can potentially be "Pentagon 1", and each direction can potentially be "Direction 0", but once a pentagon and direction is chosen, the rest is fixed. Figure 2.1 shows the code used to match the path pattern.

bool match_pattern(const Triangulation& G, const pattern_t &P)
{
  for(int u=0;u<G.N;u++)
   if(G.neighbours[u].size()==5){ // Each pentagon can potentially be Pentagon 1
     for(int i=0;i<5;i++){	
      vector<int> pentagons(12,-1);

      // Could this be Pentagon 1?
      bool all_match = true;
      for(int k=0;k<P.size();k++)
	for(int j=1;j<P[k].size();j++) if(!P[k][j].empty()) {
	    // Placing Pentagon j.
	    node_t v = follow_path(G,u,i,P[k][j]);
	  
	    if(G.neighbours[v].size()!=5) // Arrived at non-pentagon node v, pattern mismatch
	      all_match = false; break;

	    if(pentagons[j] == -1 || pentagons[j] == v) // SUCCESS! Reached unseen pentagon 
	      pentagons[j] = v;
            else                    // Arrived at previously seen pentagon, pattern mismatch 
	      all_match = false; break;

	  }
      if(all_match)
	return true;
    }
  }
  
  return false;
}

Fig. 2.1: Fast Fullerene Pattern-matching search through Eisenstein-path traversals

The code for traversing a path in the Eisenstein plane is shown in Figure 2.2.

node_t follow_path(const Triangulation& G, node_t u, const int i, const path_t &path)
{
  enum {WEST,NORTHWEST,NORTHEAST,EAST,SOUTHEAST,SOUTHWEST}; 
  if(path.empty()) return u;

  // First follow the first step along the path
  node_t v = G.neighbours[u][(path[0].first+i) % G.degree(u)];
  for(int k=1;k<path[0].second;k++){
    int w = incident_edge(G,u,v);
    u = v;
    v = G.neighbours[u][(w+EAST) % G.degree(u)];
  }

  // Next, follow each remaining step in the path
  for(int j=1;j<path.size();j++){
    int w = incident_edge(G,u,v);
    u = v;
    v = G.neighbours[u][(w+path[j].first) % G.degree(u)];

    for(int k=1;k<path[j].second;k++){
      int w = incident_edge(G,u,v);
      u = v;
      v = G.neighbours[v][(w+EAST) % G.degree(v)];
    }
  }

  return v;
}

Fig. 2.2: The Eisenstein-path traversal: Follow $path$ from pentagon $u$ through its $i$'th edge.

Fine-grained sorting: Exact surface-distance ranking

The coarse grained filtering required in the order of microseconds per isomer test. Once the isomer space has been reduced from 3×10¹¹ down to 4×10⁶, we can use more computationally intensive methods to analyze the remaining candidates. The method described here used ˜2 miliseconds per rated structure on a 2015 laptop to rank the closeness of all candidate isomers in about two hours in total.

The second-highest confidence was reported for the bottom cap. This second sorting did not require structures to exactly match the reported structure, but ranked according to the closeness in inter-pentagon surface distances within the bottom cap, as well as surface distance between caps.

Exact fullerene surface distances were calculated using the geodesics algorithm described at carma.kvante.org/geodesics/. The closeness in match for each isomer was then evaluated as the mean squared error of the 7×7 interpentagon distances compared to the bottom pattern (reported at second highest confidence). Figure 3.1 shows the details of how the closeness-metric was calculated. The 10,000 closest matches were maintained using a "$k$-smallest" data structure implemented with a heap-based priority queue in $\mathcal{O}\left(n\log(k))$ time and constant space.

double surface_sorted_distance(const Triangulation &G){
  // Matching the top pattern identifies the first 5 pentagon nodes
  vector matched_pentagons = match_pattern(G,HIV_C438_top);
  if(matched_pentagons.empty()) matched_pentagons = match_pattern(reverse(G),HIV_C438_top);
  if(matched_pentagons.empty()) {
    cerr << "No match for pattern.\n";
    return INT_MAX;
  }

  // Fill out remaining pentagons	    
  vector unmatched_pentagons;
  for(node_t u=0;u Dreduced = G.convex_square_surface_distances(unmatched_pentagons); 

  double min_err = INT_MAX;
  int min_idx = -1;
  for(int i=0;i<7;i++){
    vector Di = Dreduced(i);
    sort(Di.begin(),Di.end());
    const int *d = HIV_C438_bottom1_distance2_sorted[0];

    double error = 0;
    for(int j=0;j<7;j++) error += (sqrt(Di[j])-sqrt(d[j]))*(sqrt(Di[j])-sqrt(d[j]));

    if(error < min_err){
      min_err = error;
      min_idx = i;
    }
  }
 
  return min_err;
}

 Fig 3.1: Mean-squared error of inter-pentagon surface distances for fullerene cone bottoms compared to reported cone bottom pattern.

Preliminary Work: Results

As described above, no fullerene isomer could exactly matched the proposed structure, showing that it is not a valid fullerene, and probably incorrect, but many structures were extremely close: 95 tied for closest, with MSE 0.05, and 157 tied for second-closest with MSE 0.07. The 10,000 nearest structures were ranked, ranging from $d=0.0500313$ to $d=0.72219$. These can be downloaded here: HIV-C438-10k-best.txt.

Fig. 3: One of 77 nearest solutions, [1,2,12,15,25,138,181,186,191,198,206,210]-C438-fullerene with $d=0.0500313$ (drag to rotate)

The next step

To decide which is the actual correct structure, I will need access to the actual tomograms, and further analysis can be performed, both on the 4M crude matches, and even more expensive analyses can be applied to the 10k near-matches. I have contacted Mattei et al. to initiate a collaboration that allows me to analyse their 539 tomograms.

This would make it possible to find the exact fullerene structures matching the 539 HIV vira capsids, giving a more solid foundation for the study of how HIV vira form, the regularities and differences in their structures, what causes construction to succeed, what causes it to fail, and untimately helping design of drugs that are successful in destroying HIV vira or inhibiting their successful formation.

There are ~2.3×10¹³ fullerene structures between C₂₀₀ and C₅₀₀ -- the size range within which HIV capsids lie. The first step will be to go through this enormous isomer space and extract all the nanocones, which can be potential HIV structures. However, the number of structures is too large for the methods used in this pilot study: With a time of around two miliseconds to generate and analyse each structure, and 3.15×10⁷ seconds in a year, the present analysis needs about 1500 CPU years. I am proposing to implement the isomer generation and coarse-grained filter for nanocones as a fully pipelined FPGA using the SME (Synchronous Message Exchange) framework, developed in our group. The goal is a clock-rate of about 1MHz, meaning a process rate of $1\mu s$ per structure, and 10-50 units, in order to bring total computation time down to the region of one week to one month in total.

Once the database of all valid nanocones have been computed, many different analyses can be applied to this much smaller number of structures. It will be published independently, as it can have many uses beyond the present investigation.