ConnCompPerf

Hank Childs ran the connected components filter in March of 2010 on a hero-sized run and looked into bottlenecks and possible solutions. This page captures some thoughts in case this issue is studied again in the future.

Run details

The run was a 64 billion rectilinear grid, done on 256 processors of longhorn, the vis cluster at UT Austin. The components were taken on an isosurface of dissipation. The isosurface contained approximately one billion triangles.

Timings

NOTE: these runs prompted several improvements to the existing code in early March 2010. The timings on this page are _after_ those changes. Also, note that some timings statements are not committed in the code (but they should be self explanatory).

         Single Set Connected Components Labeling took 3.022924
         Global Label Shift took 5.667175
            Creating spatial partition took 26.855960
            Barrier after creating spatial partition took 0.004952
            Relocating using partition took 26.086254
            Multi-Set Component Label List took 1.385073
            Barrier before global union took 1.704536
            Global Label Union took 1.440838
            Barrier after Global Label Union took 0.238747
         Global Label Resolve took 57.721124
         converting ugrids to polydata in postex took 0.000008
         Breaking pipeline connections in postex took 0.000108
      avtConnComponentsExpression took 71.973077

Note that there is some unaccounted for time (72s - 3 - 5.7- 57.7 = 5.6s). IMO, the two most biggest targets for improvement are the creation of the spatial partition and the relocation. Also, note that the CMFE code shares this code base and also could benefit from such changes.

Creating the spatial partition

The algorithm recursively subdivides the problem's bounding box until it had partitioned space such that (i) there is one bounding box for every MPI task and (ii) each bounding box has approximately the same number of cells.

This code follows a stupid pattern. Assume you have 5 million points on a given MPI task. For a given iteration, it iterates over all 5 million points. It asks, for each point, what partitions that point falls in. Then it updates those partitions (may be just one). What's dumb? It doesn't need to iterate over single points. Rather, it can group points and update the boundaries based on the grouping. So imagine we had a data structure that had all 5 million points. Ideally, we could just ask "how many of your points fall in this region?" We would ask this for each of the partitions. If the data structure could answer this efficiently, it would save us "5 million points" times "Number of partitions (~500)" work.

Relocation

There are several phases to the relocation:

  1. Set up the meshes ... meaning divide up each of this MPI task's meshes to the portions that go to other processors
  2. Set up appenders ... meaning to combine the meshes and set up a single unsigned char string message for each processor
  3. Exchange info ... meaning to make MPI calls
  4. Create output meshes ... meaing to unserialize the meshes and make output we could use.
              Setting up meshes to send took 15.116134
              Setting up appenders took 6.348618
              Exchanging info took 1.260384
              Creating 65 meshes with 4326621 cells. took 3.721318
           Relocating using partition took 26.610797

(Note these timings come from the slowest processor, MPI task 72 ... the others appeared to have a long "exchanging info" time because they are waiting on this processor)

The longest time is "Setting up meshes". Note that this routine uses a lot of slow VTK routines, like "InsertNextCell". (This routine uses realloc and exhibits an MVAPICH bug, by the way.) Ideally, we would take one pass through data to calculate sizes and then a second pass to efficiently generate output.

I suspect efficiencies could be applied to the other phases as well.