In 1999, Barbie Ganser et al. discovered the HIV-1 virus capsid to have fullerene structure A decade later, using our programs and applying rules of fullerene structure by hand, Zhao et al. found a couple of fullerene structures (two C392 and C452 nanocones) that were good fits for HIV capsid shapes discovered by cryo-ET microscopy, and were able to simulate an all-atom model by molecular dynamics. Ganser and Pornillos modeled a HIV-1 capsid as a C352 structure. A later study by Mattei et al (2017) performed high-resolution cryo-tomography on 539 live HIV vira, showing that HIV capsids form a variety of fullerene nanocones. In particular, they propose specific C384 and C438 structures to match two of the tomograms.
There are 40,806,395,661 disctinct C352-fullerenes, 109,837,310,021 distinct C392 structures, and more than 2×1011 C438. Are the proposed structures for the HIV capsid really the correct ones, out of many millions of candidates, or merely similar? The published capsid structures were all constructed using rules of thumb and fitting. An exhaustive search for the actual structures has not been attempted before.
This project will use our programs to exhaustively generate all fullerene structures of the relevant sizes, and out of the hundreds of billions of structures, rank them by closeness to the experimental data. The goal is to find the set of exact structures of the observed HIV capsids. While the previously proposed structures are certainly similar in shape, the differences in structure to the exact ones may not be unimportant: The exact structures carry information about the rules that can lead to their construction. With a large selection of exact structures, we can then investigate:
(A) Computational slices through the tomographic reconstructions of four HIV-1 par- ticles, two with complete conical cores [...] superimposed with the position and orientation of each aligned hexameric and pentameric unit, revealing the fullerene structure. The CTDs are colored gray; the NTDs of the hexameric units are colored according to the quality of their alignment from red (for low cross-correlation values) to green (for high cross-correlation values); the NTDs of the pentameric units are depicted in blue. [...]
The entire isomer space of C438, consisting of approx 3×1011 distinct structures (and their enantiomers for chiral structures) were generated and ranked in closeness by the methods described below: First by a coarse-grained filtering according to the Eisenstein pattern of the top cap (which Mattei et. al report with highest confidence, marking every face as bright green). Classifying the entire isomer space required about two months of computation time on 120 CPU cores(AMD Opteron 6272, 2.1GHz). About 4 million structures matched the 5-pentagon cap pattern.
The second-highest confidence was reported for the bottom cap. A second fine-detail search ranked all the 4 million remaining structures according to the inter-pentagon surface distances within the bottom cap (i.e., not exact paths, but pair-wise relative distances, a more flexible measure), as well as the cone length. The fine-grained ranking completed in less than two hours on a consumer laptop with one Intel i7-6600U 2.6GHz CPU.
Of all 3×1011 fullerenes with the correct hexagon and pentagon count, none exactly matched the structures proposed in the Science paper by Mattei et al.; I.e., the proposed structure was not valid. However, 89 nearest-neighbour fullerene structures were found, exactly matching the reported 5-pentagon top cap, and at $d = 0.0500313$ away from the reported structure for the bottom cap in the metric described below. The 10,000 nearest structures were ranked, ranging from $d=0.0500313$ to $d=0.72219$. These can be downloaded here: HIV-C438-10k-best.txt.
Further analysis will be performed both on the 4 million structures matching the top cap, and of the 10,000 nearest structures.
The first step is to generate the entire isomer space of 3×1011 distinct C438 graphs, and filter by matching paths in the Eisenstein plane that stay entirely within green regions in the top cap shown in Figure 1.2(A) (making sure to check both chiralities). Enumerating the pentagons in clockwise order, starting from the leftmost one, Pentagon 1 is connected to the four others in the cap as follows:
pattern_t HIV_C438_top {{ {/*1-1*/{},
/*1-2*/{{0,1},{2,1}}, /*1-3*/{{0,2},{2,1}}, /*1-4*/{{0,2},{4,1}},
/*1-5*/{{0,4},{5,3}}} }};
Here, the paths are
represented as lists of pairs, the first member of which is the
direction (0-5, the number of rotations from "straight ahead" in
ascending counter-clockwise order), and the second is the number of
steps to take.
When matching, each pentagon can potentially be "Pentagon 1", and each direction can potentially be "Direction 0",
but once a pentagon and direction is chosen, the rest is fixed.
Figure 2.1 shows the code used to match the path pattern.
bool match_pattern(const Triangulation& G, const pattern_t &P)
{
for(int u=0;u<G.N;u++)
if(G.neighbours[u].size()==5){ // Each pentagon can potentially be Pentagon 1
for(int i=0;i<5;i++){
vector<int> pentagons(12,-1);
// Could this be Pentagon 1?
bool all_match = true;
for(int k=0;k<P.size();k++)
for(int j=1;j<P[k].size();j++) if(!P[k][j].empty()) {
// Placing Pentagon j.
node_t v = follow_path(G,u,i,P[k][j]);
if(G.neighbours[v].size()!=5) // Arrived at non-pentagon node v, pattern mismatch
all_match = false; break;
if(pentagons[j] == -1 || pentagons[j] == v) // SUCCESS! Reached unseen pentagon
pentagons[j] = v;
else // Arrived at previously seen pentagon, pattern mismatch
all_match = false; break;
}
if(all_match)
return true;
}
}
return false;
}
The code for traversing a path in the Eisenstein plane is shown in Figure 2.2.
node_t follow_path(const Triangulation& G, node_t u, const int i, const path_t &path)
{
enum {WEST,NORTHWEST,NORTHEAST,EAST,SOUTHEAST,SOUTHWEST};
if(path.empty()) return u;
// First follow the first step along the path
node_t v = G.neighbours[u][(path[0].first+i) % G.degree(u)];
for(int k=1;k<path[0].second;k++){
int w = incident_edge(G,u,v);
u = v;
v = G.neighbours[u][(w+EAST) % G.degree(u)];
}
// Next, follow each remaining step in the path
for(int j=1;j<path.size();j++){
int w = incident_edge(G,u,v);
u = v;
v = G.neighbours[u][(w+path[j].first) % G.degree(u)];
for(int k=1;k<path[j].second;k++){
int w = incident_edge(G,u,v);
u = v;
v = G.neighbours[v][(w+EAST) % G.degree(v)];
}
}
return v;
}
The coarse grained filtering required in the order of microseconds per isomer test. Once the isomer space has been reduced from 3×1011 down to 4×106, we can use more computationally intensive methods to analyze the remaining candidates. The method described here used ˜2 miliseconds per rated structure on a 2015 laptop to rank the closeness of all candidate isomers in about two hours in total.
The second-highest confidence was reported for the bottom cap. This second sorting did not require structures to exactly match the reported structure, but ranked according to the closeness in inter-pentagon surface distances within the bottom cap, as well as surface distance between caps.
Exact fullerene surface distances were calculated using the geodesics algorithm described at carma.kvante.org/geodesics/. The closeness in match for each isomer was then evaluated as the mean squared error of the 7×7 interpentagon distances compared to the bottom pattern (reported at second highest confidence). Figure 3.1 shows the details of how the closeness-metric was calculated. The 10,000 closest matches were maintained using a "$k$-smallest" data structure implemented with a heap-based priority queue in $\mathcal{O}\left(n\log(k))$ time and constant space.
double surface_sorted_distance(const Triangulation &G){
// Matching the top pattern identifies the first 5 pentagon nodes
vector matched_pentagons = match_pattern(G,HIV_C438_top);
if(matched_pentagons.empty()) matched_pentagons = match_pattern(reverse(G),HIV_C438_top);
if(matched_pentagons.empty()) {
cerr << "No match for pattern.\n";
return INT_MAX;
}
// Fill out remaining pentagons
vector unmatched_pentagons;
for(node_t u=0;u Dreduced = G.convex_square_surface_distances(unmatched_pentagons);
double min_err = INT_MAX;
int min_idx = -1;
for(int i=0;i<7;i++){
vector Di = Dreduced(i);
sort(Di.begin(),Di.end());
const int *d = HIV_C438_bottom1_distance2_sorted[0];
double error = 0;
for(int j=0;j<7;j++) error += (sqrt(Di[j])-sqrt(d[j]))*(sqrt(Di[j])-sqrt(d[j]));
if(error < min_err){
min_err = error;
min_idx = i;
}
}
return min_err;
}
Fig 3.1: Mean-squared error of inter-pentagon surface distances for fullerene cone bottoms compared to reported cone bottom pattern.
As described above, no fullerene isomer could exactly matched the proposed structure, showing that it is not a valid fullerene, and probably incorrect, but many structures were extremely close: 95 tied for closest, with MSE 0.05, and 157 tied for second-closest with MSE 0.07. The 10,000 nearest structures were ranked, ranging from $d=0.0500313$ to $d=0.72219$. These can be downloaded here: HIV-C438-10k-best.txt.
To decide which is the actual correct structure, I will need access to the actual tomograms, and further analysis can be performed, both on the 4M crude matches, and even more expensive analyses can be applied to the 10k near-matches. I have contacted Mattei et al. to initiate a collaboration that allows me to analyse their 539 tomograms.
This would make it possible to find the exact fullerene structures matching the 539 HIV vira capsids, giving a more solid foundation for the study of how HIV vira form, the regularities and differences in their structures, what causes construction to succeed, what causes it to fail, and untimately helping design of drugs that are successful in destroying HIV vira or inhibiting their successful formation.
There are ~2.3×1013 fullerene structures between C200 and C500 -- the size range within which HIV capsids lie. The first step will be to go through this enormous isomer space and extract all the nanocones, which can be potential HIV structures. However, the number of structures is too large for the methods used in this pilot study: With a time of around two miliseconds to generate and analyse each structure, and 3.15×107 seconds in a year, the present analysis needs about 1500 CPU years. I am proposing to implement the isomer generation and coarse-grained filter for nanocones as a fully pipelined FPGA using the SME (Synchronous Message Exchange) framework, developed in our group. The goal is a clock-rate of about 1MHz, meaning a process rate of $1\mu s$ per structure, and 10-50 units, in order to bring total computation time down to the region of one week to one month in total.
Once the database of all valid nanocones have been computed, many different analyses can be applied to this much smaller number of structures. It will be published independently, as it can have many uses beyond the present investigation.