Caching2.0

This page summarizes the caching discussion that took place on July 20th. Participants in the call were: Hank, Cyrus, Tom, Eric, and Gunther.

The motivation for this discussion was performance issues encountered by the Chombo group. We ultimately decided that it would be difficult to solve their issues.

We narrowed down on three flavors of caching:

  • Pipeline caching
  • Expression caching
  • Database caching

Pipeline caching was determined to be hard and not a good direction. We felt much of the upside would be possible with database sources, or possibly with a specialized Cache operator (more on these issues in the pipeline caching section). We all agreed that we would discourage anyone from taking this direction.

Expression caching was likely our most viable direction. We felt it would have a positive impact on the community and could be automated.

Database caching was discussed, to the extent that it would be nice to have controls to disable it, in the interest of memory footprint. This is not as trivial as it sounds and is discussed more in its own section.

Pipeline caching

Pipeline caching is quite hard for several reasons:

  • Whenever VisIt sets up a pipeline, it creates a unique ID for that pipeline. If operators get added or deleted, or if their order changes, then VisIt creates a new pipeline with a new unique ID. We think of this use case as sharing within a pipeline, but it is actually sharing across pipelines.
    • It would be possible to modify VisIt's network generation to be smarter, but that would be a lot of work.
  • With respect to sharing across pipelines: this is a deceptively hard issue:
    • If you have a mesh plot and a pseudocolor plot, each with isosurface operators and if you execute the mesh plot's pipeline, then you can _not_ share the pipeline, because you would have to re-execute to pick up the scalar var for the Pseudocolor plot.
    • If you have a pseudocolor plot and a mesh plot, each with isosurface operators and if you execute the pseudocolor plot's pipeline, then it would be difficult to determine if you could share the pipeline, because you would have to compare the contracts and determine that the requirements in the mesh plot's contract is a subset of the requirements in the pseudocolor's contr.
    • If you have a single plot, Pseudocolor, Isovolume operator, then slice operator, and you are mucking with the slice, then you may or may not be able to reuse the input to the slice filter (and prevent re-execution of the Isovolume operator), because the slice filter may have restricted which domains are used. Worse, it may have communicated the slice plane to the reader, which honored the request (which means the contracts would be identical). That is solvable, but the contract would have to be correlated against the data attributes, which say whether or not the database applied an optimization.
  • We also discussed a Cache operator, which allows for explicit control.
    • This cache operator works for the use case where a user has a complicated pipeline and they want to repeatedly modify the attributes of an operator towards the bottom (i.e. last operator is a slice).
    • But if you are adding and removing operators, or even changing their order, then it will create a new pipeline ID and the Cache operator will be ineffective.
  • We felt that the issues here were complex and that we were very unlikely to pursue them, because they were a large investment and a marginal gain.
  • We also felt that the "create new source" infrastructure would address the same problem, although not as elegantly.

Expression caching

Our notion of expression caching was to have the EEF at the top of the pipeline return its expressions to the generic database, so that other pipelines could reuse those calculations.

Several issues were raised:

  • It would be difficult to do this with a DeferExpression operator, because we don't know what's happened before the DeferExpression executes. For this reason, we will disallow this case.
  • Expressions often change, and it is important we don't cache the old values and then use them after they have gone stale. Hank initially wanted to suggest a string based approach (foo="foo1+foo2", so cache "foo1+foo2"), but realized that nested expressions would cause problems (e.g. foo2 is also an expression and its definition changes). We decided that a revision number for the ExpressionList would address the problem, and that we could achieve a 90% solution by simply dumping all expressions when the ExpressionList changed. This isn't good for someone who is mucking with expressions, but, for the most part, expressions don't change value. Note that this requires infrastructure where a change in ExpressionList causes a callback to the GenericDatabase to be called, where the GenDB in turn clears out all expressions.
  • Cyrus raised the issue of memory footprint. This has consequences about clearing out expressions when the time slice changes, etc, like we do with a normal variable.
  • We also need to have it be that when "cache Expressions" is turned off, we go and clear out the expressions from the GenDB, which gives the user explicit control over cleaning out that memory.

Database caching

At first glance, this sounds like simply adding a button and not caching data if the button is disabled. However, Cyrus correctly pointed out that the cache is actually a temporary holding spot for data structures like avtMaterials and avtStructuredDomainBoundaries. Disabling these structures from being passed down the pipeline would cause serious problems. So the disablement of caching would need to still allow these structures. Hank commented that disabling the caching of VTK objects would be a reasonable path.