Database Format Detection

Database Format Detection Proposals

To-Do items orthogonal to the detection itself

  • automatic solution to problem of a plugin coring the mdserver
  • avoid ntimes*ndomains constructors when setting up database
  • disable database plugins without restarting visit (easy since we don't have to unload them)
  • enable database plugins without restarting visit (harder unless we currently load all the libI ones but don't use them)
  • Reopen fails when OpenAs with database options is used b/c we don't retain the options.
  • ...?

Proposal #1 : pure filename globs

  • Short description: user-modifiable pure filename w/ globs based approach, but with assumed, fallback, and unglobbable-able plugin lists.
  • Settings:
    • maintain a user-modifiable multi-map of filename glob -> plugin. E.g. "*.pdb" ==> {PDB, Silo} ; "OUTCAR*" ==> {VASP}. The plugins for each glob could be ordered.
    • have an ordered list of plugins to try if a file extension matches none of the known patterns. (This is because some formats don't have any extension or conventional filename.)
    • have an ordered list of plugins to try if the file extension does match a glob but it still fails to open. (equivalent to fallback format; the first case where users have a common set of codes they work with, but in they case they don't trust THOSE readers to be strict enough)
    • have an ordered list of plugins to try before even checking the file extension. (equivalent to assumed format; the other case where users have a common set of codes they work with, but in this case they don't trust the OTHER readers to be strict enough)
  • Process:
    • Try all the "assumed" plugins
    • then try any which match the glob (or the ones which don't have any filename matching if no glob matches)
    • and then try the fallback ones.
    • If none of these work, fail hard.
  • Pros:
    • Does about everything you could hope to do based purely on the filename.
  • Cons:
    • Too complex and confusing, relies too much on the user to configure it correctly.
    • Doesn't involve any ability for the plugin to make a claim on any file based on its contents, just its name.
    • This proposal doesn't yet accommodate directories.
    • The user won't know whether the e.g. PDB or Silo reader should come first; one is more strict than the other, so there's really an implicit ordering, and it's hard for the user to know what that should be.
  • Possible tweaks to this proposal:
    • A cleaner interface would be to just have the globs for each file format be editable by the user (in other words, map from Plugin to globs instead of glob to plugins), but that prevents us from ordering the plugins for any given glob.
    • Simplify the assumed/fallback format support. That's a big part of what really makes this too complex for users.

Proposal #2 : Sean's

(note: copied verbatim by Jeremy from Sean's original email, might need some massaging/tweaks since we had discussion after it was sent)

The task is to determine what database plugin is appropriate for a given filename chosen by the user. The process starts by considering all plugins as candidates for the file, then progressively eliminates plugins from candidacy. At any point in the following sequence, break out if the set of candidate plugins has only one left.

  1. Check whether the filename points to a file or a directory. Exclude all plugins that do not handle the given type.
  2. Check the set of file extensions of each plugin. Exclude all plugins that do not handle the given type. (stopping here is what VisIt does currently)
  3. For the remaining plugins, hand them the filename and ask them for a 'maybe' or 'no' determination on just the characters of the filename. No disk I/O allowed here. Could do regex or other pattern matching. Any plugin saying 'no' is excluded.
  4. (directories). For directory plugins, hand them a list of files contained in the directory and ask them for a 'maybe' or 'no' determination on just the names of the files contained in the directory. No disk I/O allowed. Regex or other pattern matching. Any plugin saying 'no' is excluded.
  5. (files). For file plugins, read in the first N bytes from the file. For the remaining plugins, hand them these N bytes and ask for a 'maybe' or 'no' determination on those bytes. Any plugin saying 'no' is excluded. We might be able to do some optimizations for plugins that are NetCDF/HDF/whatever based. I could see doing a hash lookup for some of this.
  6. For the remaining plugins, hand them the filename and ask them for a 'maybe' or 'no' determination on file metadata. Any I/O is allowed, but we ask that no problem-sized data be loaded. One would hope that by phase 4, the set of plugins would be very small, thus impacting I/O very lightly.
  7. Take the remaining set of plugins and prompt the user for a choice. For command-line, fail.
  • Possible tweaks to this proposal:
    • Tom suggests that if any point you got to a single likely candidate, try it in more detail before proceeding.

Proposal #3 : Mark's (more of a requirements specification)

We have over 100 database plugins now and that number will only continue to grow. We rarely eliminate old plugins and we should probably think about developing a process for doing that (Dune and Vista are good examples as neither format is in use anymore, I think). As I see it, the goal is to as much as possible without input from the user other than the file (or directory) s/he wants to open with VisIt determine which plugin should be used as quickly as possible. How much can we do with just the file (or directory) name? How much can we do by just reading and examining some bytes (say the first N) of the file? How much can we do by XXOpen'ing the file with libXX I/O library and doing XX-specific I/O operations on the file? How expensive are each of these levels of inspection? How often is the answer as to which plugin to use still ambiguous after each level of inspection? How easy is it for a naive plugin developer to cause VisIt to behave badly for all users because of the way that a single plugin is coded? For example, we've had recent experiences where a segv in an HDF5 plugin basically had the effect of rendering all HDF5 plugins unuseable WITHOUT user-intervention because VisIt would re-try each HDF5 plugin, in order, eventually reaching the one that caused a segv and never reaching the correct one.

Without a doubt, an approach based on the filename (including its extension, if any) is going to perform the fastest. An approach requiring each plugin to attempt to open the file (often requiring the initialization of an underlying I/O library like HDF5) and examine its contents is going to be much more accurate but also more expensive, especially if each plugin is tested in this manner in one after another. However, while filename extensions have been used successfully for the most part on Windows PC's for many years, for certain communities of VisIt users, that approach has proven to be useless. At the same time, while VisIt has over 100 database plugins, any given user is probably likely to use only a few of them in any one session or even over many months of use of VisIt. The list below identifies some of the requirements for automatic plugin detection...

  1. In general, the process of identifying a plugin for a given file (or directory) involves a tree-like decision hierarchy the upper levels of which are faster, but less accurate while the lower levels are slower but more accurate. We have identified three distinct types of interaction
    1. filename inspection where the string representing the filename (or directory and its contents) is examined using one or more of exact extension match, globbing and/or regex'ing.
    2. byte-level inspection where the file is treated as a stream of bytes and we open the file, read and examine some subset of these bytes.
    3. full inspection where the file is opened (by the plugin) using whatever I/O library (e.g. HDF5), if any, and examined using whatever means the plugin deems appropriate.
  2. Each level of inspection returns one of these answers
    1. 'no', the plugin cannot open the file
    2. 'not enough information' if the plugin can open the file or not
    3. 'yes', the plugin can open the file
    4. 'absolutely', the plugin can open the file and it is believed to be the ONLY plugin that can open the file
  3. The last answer is advantageous and problematic. Its advantageous for cases like Silo where a '.silo' extension is unlikely to be used for non-Silo files. However, its problematic if a VASP file that just happened to have '.silo' as the last characters in its name because it allows the Silo plugin to make an absolute decision that effects all plugins. What I think this means is that if a user wishes to, s/he can enable an 'absolutely' response (on a per-plugin basis) but that VisIt off-the-shelf should not.

There shall be an important distinction between the first two of these levels of inspection and the last. In the first two, the code that does the actual work of the inspection resides NOT in the plugin but within VisIt itself. For example, in filename inspection, VisIt is given exact extension match(es), glob(s) or a regex(s) associated with a given plugin and VisIt performs the work of inspecting the filename. Likewise for byte-level inspection. VisIt is given a vector of offset(s) and length(s) of segments of bytes to read as well as magic numbers, sequences of bytes (perhaps also involving conditionals like unix /etc/magic) and VisIt performs the work of the byte-level inspection. In this way, VisIt is assured of having the control necessary to avoid issues with poorly coded plugins as much as possible (e.g. plugins that due problem sized I/O in the wrong place). Only after filename inspection and/or byte-level inspection is control passed to a given plugin for full inspection.

  1. If it is at all practically possible within VisIt's design and architecture, users should be given the ability to disable and enable database plugins without having to re-start VisIt. Why? As a first step in helping VisIt to make an automatic plugin determination quickly, users ought to be able to tell VisIt which plugins they NEVER expect to use by disabling them. At the same time, if they wind up ever needing them, they should NOT have to re-start VisIt to get them.
  2. For plugins for which it makes sense (e.g. for which it typically works in practice), it should be possible to associate extact exentsions, glob(s) and/or regex(s) with a plugin for filename inspection. Furthermore, where it makes sense, this filename based matching should be both defined by default for a plugin when VisIt is installed as well as programmably alterable by a user while VisIt is running including totally disabling any filename inspection for any given plugin. The string matching (again, extact extension, glob and/or regex) should be defined in such a way that VisIt need not have to dlopen() anything to obtain the matching criteria. Currently, VisIt must dlopen() XXXPluginInfo in order to obtain each plugins extensions. In 2.0, it should be able to obtain this information from some human readable text file (an xml file maybe). For example, the '.silo' extension is associated with the Silo plugin when VisIt comes 'off-the-shelf'. Nonetheless, a user can decide to add '.pdb' as an extension for a Silo plugin (perhaps now causing ambiguity with a PDB plugin, an example discussed below) so that the Silo plugin will be included candidates to open any file with '.pdb' as an extension. Or, a user can remove all string-based filename matching associated with the Silo plugin so that filename alone will NEVER result in exclusion (or inclusion) of the Silo plugin.
  3. For plugins for which it makes sense, it should be possible to associate byte-level inspection criteria (e.g. /etc/magic file-like stuff). Again, this shall be defineable both by VisIt off-the-shelf as well as programmable by users and shall not require dlopen()'s of any code for VisIt to obtain.
  4. Plugin developers may (but are not required to) implement an 'IdentifyFormat()' method which VisIt may call (but is not required to call and will only call when conditions trigger it) where control is passed to a plugin to do whatever work (but presumably the smallest work necessary) it sees fit to render a yes/no/maybe decision on whether the plugin should be used to open the file.
  5. VisIt will remember and build up, over time, plugin statistics for each user over all plugins indicating which plugins a user has most frequently used over several windows of time (e.g. last day, last week, last month) to be used to order plugins for any of the three inspection processes. In order to gain information regarding obsolete-ness of a given plugin, we should maybe log plugin usage in VisIt installations so that VisIt developers can know when a given plugin is no longer needed (and who has been using it so that such persons can be contacted regarding obsoleting the plugin)
  6. Users may assign relative ranks to plugins to be used for those cases where after all levels of inspection have been exhausted and a specific plugin has been identified to try each plugin in order. Note that the ability to totally disable plugins can be interpreted as a ranking process that always puts the disabled plugins into a 'never attempt' category.

The following is pseudo-code for a plugin identification algorithm using the procedure in Sean's proposal

  1. Assemble initial list of candidate plugins from the list of enabled plugins
  2. Sort the list according to frequency of use
  3. Iterate over current list of candidates applying filename inspection criteria but assume 'maybe' for any that do not define filename inspection criteria. If 'absolute' response is encountered, use that format. Otherwise, continue assemblying list of candidates.
  4. After the above step is complete, list of candidates will have had removed from it any that responded 'no' but will include all that responded 'yes' and 'maybe' as well as any that were assumed 'maybe' because they didn't define filename inspection criteria. If this list is of size one, use that format.
  5. Obtain maximum length of bytes required over all candidates for byte-level inspection. Open and read this number of bytes from the file. Iterate over all plugins applying byte-level inspection criteria but assume 'maybe' for any that do not define byte-level inspection criteria. If absolute response is encountered use that format. Otherwise, continue assemblying list of candidates.

There is a problem here. For any plugins that define only a full-inspection method but do not define filename and/or byte-level inspection, these plugin's IdentifyFormat() method will ALWAYS BE CALLED when VisIt attempts to open ANY file unless some other plugin returns an 'absolutely' response before getting to the full inspection pass. This will cause performance issues. The only remediation is either a) statistics are such that the most likely plugin to be needed is attempted first or b) the user has disabled such plugins (but s/he won't necessarily know which to disable to avoid this performance hit).

Proposal #4: Jeremy's overly complex probabilistic concrete variant of Sean's

  1. Check whether the filename points to a file or a directory. Exclude all plugins that do not handle the given type.
  2. Call a new "IsThisFilenameYours" function for each remaining plugin. Possible return values are "Likely", "Maybe", "Unlikely", and "No". (Yes, this is an awful name for the function.)
    • Examples:
      • PlainText returns "Maybe" for anything
      • VASP returns "Likely" for CHGCAR* OUTCAR* or POSCAR* and "Unlikely" otherwise
      • Image returns "Likely" for *.jpeg (etc.) and "Unlikely" otherwise
      • Enzo returns "Likely" for *.boundary or *.hierarchy and "No" otherwise
      • Silo returns "Likely" for *.silo, and "Maybe" otherwise (this is the most common pattern, I think)
  3. Step through each plugin in the order "Likely" category, and call a "AreTheseBytesYours" with the first (e.g.) 10kb of the file. Possible return values are "Likely", "Maybe", and "No". (Yes, this is an awful name for this function, too.)
    • Examples:
      • PlainText returns "Maybe" for anything ASCII, "No" otherwise
      • ProteinDatabank returns "Likely" if the first line(s) contain appropriate keywords, "No" otherwise
      • VASP returns "Likely" if it can find some keywords for OUTCAR files, or if it can find an integer and some floats at first for CHGCAR files, etc.
      • Pixie, Chombo, etc. all return "Maybe" if the first few bytes are "HDF"
      • NetCDF returns "Maybe" if the first few bytes are "CDF
  4. For any that return "Likely", attempt a full Open. If succeeds, we're done.
  5. Repeat step 4 for any that returned "Maybe" from the byte check.
  6. User control overrides:
    • User can disable any plugin.
    • User can place a "preferred" mark on any plugin. For any of these, as long as it does not return "No" in either the filename check or the byte level check, it gets a full Open attempt before any others.
  7. Pros/Cons
    • Pro: determines rough ordering priority based only on the probabilities from these two functions. Is used to make sure the right plugins are tried first, reducing both extra byte-level checking, and reducing the chance that the wrong plugin will have a chance to open the file (and erroneously report success). On the other hand, wrong probability answers can subvert this.
    • Con: the complexities of having a plugin developer determine these likelihoods might be hard to justify.
  8. Variants:
    • Could collapse the filename/bytecheck into a single check that determines the order instead of doing a strict filename one first. The filename first is based somewhat on the assumption that a filename is a good (though not only) indicator of file type.
    • One could remove "Unlikely" in the filename matching and replace with "Maybe" (getting 3-state for filename) and collapse byte matches to a simple "yes"/"no" response, and still probably be as effective. We can contrive examples to test the effectiveness of these extra levels of probability.
    • One could make the filename check a simple glob/regex, plus a flag saying whether or not the check is strict. If it matches a regex, you get "Likely", if it doesn't, then you get either "Maybe" or "No" depending on the "strict" flag.
      • Silo returns "*.silo" and false (not strict)
      • Enzo returns "*.boundary" and "*.hierarchy" and true (strict)
      • PlainText returns no glob/regexes (NOT "*") and false (not strict)
      • Thinking out loud here, but with this variation, nothing should return "*" unless its byte check is *very* good.
      • Similarly, an empty list of glob/regexes and true is meaningless -- the format would never get called.

Proposal #5: Jeremy's simpler concrete variant of Sean's

In short, the changes to the naive file opening scheme in 1.11 are the following:

  • change "extensions" to globs or regexes
    • MCM: regexs provide so much more power but are harder for ordinary users. But, a regex would allow an Ale3d developer for example, to say that files that start with letters, numbers and dashes, followed by underscore followed by 5 digits followed by underscore followed by 6 digits are silo files
  • let the user edit these globs/regexes
  • add a byte check function called right before ever actually opening (exception: "Open As" skips this check). This is passed some relatively small number of bytes at the beginning of the file and can indicate that the file is unreadable for the DB.
  • instead of an -assume-format flag, let the user specify several "preferred" formats which are tried first (if they pass the byte check)
    • MCM: There is probably an issue here with, for example, HDF5 plugins in that the byte-check for them is probably just going to look for 'HDF' in the first 3 bytes and so they won't disambiguate on that. Some HDF5 plugins, like Pixie, think they can read any HDF5 file and if Pixie happens to be tried ahead of something like SAMRAI and the input file is indeed SAMRAI, the wrong plugin will be used.
  • instead of a -fallback-format, all formats become fallback formats (again, if they pass the byte check)
  • (maybe) add a "strict" flag to the filename glob/extensions for cases like Enzo (might be unnecessary)
  • oh, and let plugins open directories instead of files. (The byte check here might be a directory listing, or something like that.)
    • MCM: Culling out 'directory' as a special case seems a bit contrived. I can think of other features of a string representing the name of an object on disk that we might want to use to help disambiguate plugins. A good example might be the filesystem the file lives in. I can imagine a plugin that uses FS-specific coding and can read files ONLY in a specific filesystem because of it. That would be natural especially for many of these high end parallel filesystems that seem to vary from platform to platform. Another feature might be the actual size of the file. I mention these only to suggest that I think this notion of directory could be generalized to represent multiple different features of the file of which directory is just one example. That could effect the interface to represent this information.
      • Jeremy says: I believe there is some mis-communication here. I'm not talking about features of a file, of which "directory it lives in" is one. I'm talking about opening a whole directory INSTEAD of an individual file. You can't open a "10 megabytes" or a "lustre" -- those statements are nonsense. In contrast, saying you want to open "path/to/s3d/files/" makes perfectly good sense. Therefore, size or filesystem are not equivalent concepts to directory, at least not in the way I mean it. The desire to open a directory instead of a file is simply not contrived: it occurs in both the Enzo and S3D readers, and probably others.
      • Sean: Jeremy is correct here. I am talking about opening a directory INSTEAD of a plain file. That's a different thing than learning what directory a plain file lives in.
      • Sean: With that said, I could see having information about the file (size, file system, Lustre striping, etc.) being useful information to a plugin. Not sure how to generalize this information, or why generalizing it would be better than letting the plugin find out the information on its own. But I could see that maybe being useful someday. Not for 2.0.

Specifically, the steps for opening files are:

  1. Check whether the filename points to a file or a directory. Exclude all plugins that do not handle the given type.
    If the user says "Open As", try a full open on the file.
    • MCM: The OpenAs dialog should place the 'preferred' plugins, if any, at the top of the pull down list. Can the OpenAs dialog use filename match to order this list? What about byte-match?
  2. First, walk through the remaining plugins marked as "Preferred" by the user and do a byte check on the first 10kb of the file.
    • If any pass this byte check, try a full open.
    • If any full open succeeds, stop.
  3. Ask for a regex/glob for the remaining plugins (not "preferred"), and a boolean on if it's strict.
    • Examples:
      • PlainText returns no regexes (meaning anything matches) and "False" (not strict)
      • VASP returns CHGCAR* OUTCAR* and POSCAR* and "False" (not strict)
      • Enzo returns *.boundary and *.hierarchy and "True" (strict)
  4. For ones that return a regex/glob filename match, do a byte check on if the file can open it or not.
    • If any pass the byte check, try a full open.
    • If any full open succeeds, stop.
  5. So now go ahead and fall back to doing a byte check on ones that didn't pass the regex/glob (unless they're strict)
    • If any pass this byte check, try a full open.
    • If any full open succeeds, stop.
    • MCM: We could opt not to stop at the first successful instantiation (e.g. SetupDatabase) and instead continue through all that are still candidates and see which, if any others, also successfully instantiate. Then, we can ask the user to select from among these or grab metadata from each by calling PopulateDatabaseMetadata and select the one with the most metadata or as Jeremy suggested, remove any that do not have any meshes or curves and upon removing those, if the list of candidates is still larger than one, ask the user. In the case that we do eventually get to the point where we ask for user input to disambiguate, then in such a dialog, we also ought to point the user towards the related dialogs that would allow them to set preferences, disable plugins, etc, such that they can avoid having to help VisIt with this kind of file in the future.
  6. After all possibilities have been exhausted, ask the user by bringing up OpenAs dialog

User control overrides:

  • User can disable any plugin.
  • User can place a preferred mark on any plugin. For any of these, as long as it does not return "No" in either the filename check or the byte level check, it gets a full Open attempt before any others. One might make this preferred list sortable as well.
  • One can allow users to edit filename globs (not sure what to do about "strict" ones). The advantage is that it can be more specific; it allows things other than the equivalent of a "*" glob that the "preferred" flag does. However, it does not bump these up in priority.
  • Doing both the glob and preferred-mark user overrides is more work, but probably the best idea.

Pros/Cons

  • Pro: tries ones that match the glob first, but can still fall back on other ones provided they pass a byte check.
  • Pro: avoids byte checking until it either passes a filename match, or we're forced into more detailed inspection
  • Pro: minimal effort, I think: change extensions to globs, add a flag, add a byte check, add one "Preferred" user boolean.
  • Con: for e.g. HDF5 files, if the filename doesn't rule out any plugins, we have to do an HDF5 open on it multiple times; doesn't mean there's not a workaround here, but might not be worth it -- we'd have to let multiple plugins share an HDF5 plugin. Using preferred reader flags, and editable globs, is probably a more effective solution.
  • Con: with only a single level of byte check probability, how do we prevent plugins like PlainText, which are overly lenient, from opening other ASCII files without known extensions? (possible ideas: (a) have no byte check for that plugin, (b) have some other flag like "I'm a lenient plugin", (c) or, of course, have two levels of "yes")

Notes:

  • The whole "strict" flag may be unnecessary. by the time we get to a full open, this should be effective at failing and thus falling back to other readers.
  • MCM: SetupDatabase is going to have to be re-worked as part of this effort: there are issues with it instantiating many format objects and unnecessary exposure to risky code (code in a plugin that may not be too nice about how it lives with other plugins). Just reading through Stroustrup's C++ Programming Language, there is a good section, 14.4, in there on "resource acquisition is initialization" and the use of exceptions to return from constructors which is currently the only way SetupDatabase gets informed that an attempted plugin is not appropriate. I think using Exceptions to abort failed construction is probably fine. However, I think we need to audit existing plugins to ensure they a) do indeed do some work to confirm they are in fact able to read the given file, b) throw an exception if not and c) free up any resources they might be grabbing in the constructor. Many of the simpler plugins do NOT have any exception in the construction process.
  • MCM: We need to deal with the case where a plugin cores the mdserver during this process. Currently, the mdserver will restart and re-core with the same bad plugin even if that plugin is the wrong one to be using for the given input file. Jeremy suggested sending information to the viewer just prior to an attempt to instantiate via SetupDatabase so that if the mdserver cores as a result and restarts, the viewer can remind it not to try that plugin again. I suggested the mdserver create a small file in the current working directory with the name of the plugin it is about to try and removing that file if it succeeds. However, if it winds up coring and restarting, then it would find that file and decide NOT to try that plugin. The viewer approach is better but requires interaction with the viewer during the open bootstrap process on a per-plugin basis. That sounds potentially bad. In client server, the mdserver will have the vcl process to talk with so maybe that is a better choice in that mode. I wonder if we might be able to store some information to the enviornment from the parent process of the mdserver? I guess that might be same as rpc'ing that info around. I honestly don't know.
  • MCM: So, I am really, really defensive about getting a bad plugin in the mix that for some reason takes a really long time and/or uses a lot of memory to do its byte-check and/or instantiation attempt. If I could do it, I'd probably set some sig-alarms for a couple of hundred milliseconds in the future such that if the plugin hasn't returned from the call by then, we except out of it as a 'too slow to be useful' exception. I know thats overkill but I just like the idea of being able to confidently prevent one bad apple from ever spoiling the bunch.
  • MCM: Cyrus mentioned an important use case that we probably don't test enough and that is a corrupted input file. An example might be say corrupted XDMF input. The XDMF reader's filename and byte-checker DO NOT eliminate it and lets suppose the 'I can read all things ascii' plugin comes just after it. The XDMF plugin will either a) exception out during instantiation, b) populate database metadata correctly or c) fail to read a particular variable from the file. If it fails to instantitate, then the (ficticious) ascii plugin will instantiate but return no metadata in the populate database metadata method. So, the user will be faced with what appears to be a working VisIt but unable to plot anything with no explanation as to why. Cyrus suggested that perhaps in the case of no metadata we could use that as an indication of either a) error or b) warn the user in the GUI. However, this is predicated on the notion that a plugin will produce metadata only when there is indeed something for the plugin to read. Several plugins, like Image, summarily populate md without ever really confirming they can read their input. That is a problem that probably needs to be corrected in all plugins where it currently exists, as part of this work.

Proposal #600: Cyrus' repackaging of everyone else's ideas

(note this is basically just a snip from an email i sent to the developers list)

Start with an ordered list of plugins with their globs.

The user can modify the list order, the globs, and enable or disable plugins.

Pass 0:

  • Narrow our search based on databases that expect a file vs directory.

Pass 1:

  • Check glob(s) if this check fails opt out.
  • Check bytelevel criteria if this check fails opt out.

(opt_out = !(glob & bytelevel))

In the case of a directory db the "bytelevel" check would instead just look for key files in the directory.

Pass 2:

  • For remaining plugins try full open, if we succeed use this plugin.
  • Tell the user you "successfully" opened their database with the plugin in some non-annoying way.


Pros:

  • Ultimately the order of the list is king, but thats good b/c it's the *easiest* thing that can be changed by the user.
  • In this case both the globs and the bytelevel checks can be very permissive, but they still provide a useful way to narrow down the choices before a full open.
  • Allows the user to easily fix things when a false positive opening scenario occurs.
  • Allows the user to control the order full opens occur, which can marginally help performance.
  • I think it is the simplest conceptually & to implement. (Of course take this with a grain of salt)


Cons:

  • Producing an initial list may be a concern for some parties, but I would argue even starting with random list would still ultimately give the user more control and be preferable to denying them influence over precedence.
  • May require user tweaking for odd cases.
  • (Please add any concerns....)

Creating the initial list seems to be a deal breaker for too many parties.

I think Jeremy's scheme #5 provides the amount of control we need to resolve problems if the two things are included:

  • A user controllable ordering for preferred plugins.
  • A user option for using a failed glob match to exclude a plugin. (the 'strict' glob option)

Example Problems

Setup

file format glob pattern byte-level check returns "Maybe" when: (and "No" otherwise)
PlainText * anything ASCII
LAMMPS * looks for known keywords in first line(s)
Silo *.silo anything starting with HDF or PDB in the first-few bytes
PDB (PACT) *.pdb anything starting with PDB in the first few bytes
Protein Data Bank *.pdb looks for known keywords in first line(s)

Problems

  • file ending in *.pdb that's a protein data bank file
  • file with no extension that's a protein data bank file

(more to come)

Other Suggestions

(Jeremy, from a conversation with Allen on Dec 29 2009. Some of these ideas may already be mentioned above.):

  • Show the plugin used to read a file more prominently, e.g. in the "Source" dropdown, or selected files list, or something like that. Maybe a status message.
  • Whether as an option or always, if more than one file extension matches, present a list of choices to the user? (Maybe only if no preferred reader matches.)
  • The downside of the previous one is the non-interactive or command-line implementation. Instead, maybe a *warning* if more than one matches? This works in any mode.
  • Also, one path may be to *try* more than one plugin if more than one matches the filename. That way we can restrict the warning / choice scenario (from the previous 2 bullets) to case where more than one plugin can potentially work.