Hank Childs ran the connected components filter in March of 2010 on a hero-sized run and looked into bottlenecks and possible solutions. This page captures some thoughts in case this issue is studied again in the future.
The run was a 64 billion rectilinear grid, done on 256 processors of longhorn, the vis cluster at UT Austin. The components were taken on an isosurface of dissipation. The isosurface contained approximately one billion triangles.
NOTE: these runs prompted several improvements to the existing code in early March 2010. The timings on this page are _after_ those changes. Also, note that some timings statements are not committed in the code (but they should be self explanatory).
Single Set Connected Components Labeling took 3.022924 Global Label Shift took 5.667175 Creating spatial partition took 26.855960 Barrier after creating spatial partition took 0.004952 Relocating using partition took 26.086254 Multi-Set Component Label List took 1.385073 Barrier before global union took 1.704536 Global Label Union took 1.440838 Barrier after Global Label Union took 0.238747 Global Label Resolve took 57.721124 converting ugrids to polydata in postex took 0.000008 Breaking pipeline connections in postex took 0.000108 avtConnComponentsExpression took 71.973077
Note that there is some unaccounted for time (72s - 3 - 5.7- 57.7 = 5.6s). IMO, the two most biggest targets for improvement are the creation of the spatial partition and the relocation. Also, note that the CMFE code shares this code base and also could benefit from such changes.
Creating the spatial partition
The algorithm recursively subdivides the problem's bounding box until it had partitioned space such that (i) there is one bounding box for every MPI task and (ii) each bounding box has approximately the same number of cells.
This code follows a stupid pattern. Assume you have 5 million points on a given MPI task. For a given iteration, it iterates over all 5 million points. It asks, for each point, what partitions that point falls in. Then it updates those partitions (may be just one). What's dumb? It doesn't need to iterate over single points. Rather, it can group points and update the boundaries based on the grouping. So imagine we had a data structure that had all 5 million points. Ideally, we could just ask "how many of your points fall in this region?" We would ask this for each of the partitions. If the data structure could answer this efficiently, it would save us "5 million points" times "Number of partitions (~500)" work.
There are several phases to the relocation:
- Set up the meshes ... meaning divide up each of this MPI task's meshes to the portions that go to other processors
- Set up appenders ... meaning to combine the meshes and set up a single unsigned char string message for each processor
- Exchange info ... meaning to make MPI calls
- Create output meshes ... meaing to unserialize the meshes and make output we could use.
Setting up meshes to send took 15.116134 Setting up appenders took 6.348618 Exchanging info took 1.260384 Creating 65 meshes with 4326621 cells. took 3.721318 Relocating using partition took 26.610797
(Note these timings come from the slowest processor, MPI task 72 ... the others appeared to have a long "exchanging info" time because they are waiting on this processor)
The longest time is "Setting up meshes". Note that this routine uses a lot of slow VTK routines, like "InsertNextCell". (This routine uses realloc and exhibits an MVAPICH bug, by the way.) Ideally, we would take one pass through data to calculate sizes and then a second pass to efficiently generate output.
I suspect efficiencies could be applied to the other phases as well.