|
From: | Felix Höfling |
Subject: | Re: [h5md-user] Unit attribute versus non-dimensionless quantities |
Date: | Wed, 31 Jul 2013 10:24:54 +0200 |
User-agent: | Opera Mail/12.15 (Linux) |
Hi all, Upon reading the section on Time-independent data, I realized that the specification is inconsistent with regard to the unit attribute versus time-independent box attributes, or more generally, the unit attribute versus time-independent non-dimensionless data stored in an attribute. The issue is that HDF5 attributes cannot have additional metadata, i.e., attributes cannot be attached to an attribute. If the length is chosen as a non-dimensionless quantity, then the box edges and offset require a unit attribute. However, if the box is fixed in time, the box edges and offset are stored as attributes, and thus cannot have a unit attribute. A solution would be to store the box edges or offset as either a data group (time-dependent), a dataset (time-independent with unit), or an attribute (time-independent without unit). Further, the specification could state that a dimensionless time-independent quantity is stored as either an attribute or a dataset (depending on the size of the data), while a non-dimensionless time-independent quantity is stored as a dataset with a unit attribute. An alternative would be to store the box edges and offset as either a data group or a dataset, and generally forbid attributes for storing quantities which could potentially carry a unit. However, I would rather avoid this restriction, since it causes a mess of attribute and dataset quantities in my output files, despite all of them being dimensionless time-independent quantities. Another approach would be to dispose of the unit attribute, and store a non-dimensionless quantity as a compound of a number (the value) and a string (the unit). This would allow storing quantities with a unit in both datasets and attributes, the latter being especially useful for any kind of parameter. Since I have no experience with compound data types, e.g., how they would affect the size and compression of a time series, or how usable they are across programming environments, this solution would have to be postponed until after H5MD v1.0. How would you resolve the issue? I feel that the safest, future-proof solution is to move the unit attribute from the specification to the discussion section, and find a solution later, that has proven itself in practice. I have yet to hear from someone who is consistently using H5MD data with units in their simulations, never mind experiments. Peter
Hi Peter, This is indeed a problem, very carefully observed!I have a bit mixed feelings about ignoring completely the possibility to specify the physical unit of the data. This is a particular strength of HDF5 and, albeit almost trivial, a clear improvement over commonly used file formats. (E.g., VMD expects the input to be in Ångstrøm if I'm not mistaken.)
The combination "dataset plus unit attribute" seems to be precisely in the spirit of HDF5 to annotate data with attributes. From this point of view, I would suggest that dimensionful data are stored as HDF5 datasets. On the other hand, favouring attributes over datasets for small data makes sense as discussed extensively.
The trouble starts when the attribute itself is dimensionful. The example in the HDF5 manual
http://www.hdfgroup.org/HDF5/doc/UG/13_Attributes.htmlskips this issue and simply assumes that temperature is in centigrades and pressure in atmospheres!? The problem has been recognised, see e.g.,
http://lists.hdfgroup.org/pipermail/hdf-forum_lists.hdfgroup.org/2009-March/000439.html, but until now there appears to be no nice solution.The solution with compound types is quite cumbersome, I would like to see h5py code implementing this (the h5py manual says almost nothing about compounds):
http://hdf-forum.184993.n3.nabble.com/attribute-units-td1526251.htmlCompound types are not easy to use (and maybe not fully supported by all top-level APIs). They may prevent users from using such attributes at all, and eventually, people will continue to assume rather than to specify the unit of a dataset. In my opinion, compound attributes are clearly not a solution.
A global solution would be to add unit attributes to the h5md group for each kind of (the 7 basic) dimension: length, time, mass, temperature, electric current, amount of substance, luminous intensity:
http://en.wikipedia.org/wiki/Si_units#Base_unitsBut this seems to be very restrictive and it requires some physics knowledge to reconstruct the derived units. Further, it would break modularity as datasets would not be completely independent of each other.
In conclusion, I tend to Peter's solution #1: we leave the optional unit attribute as it is, but state in the general section that dimensionful, time-independent data have to be stored as datasets _if_ the unit attribute is needed. In order to avoid too many possiblities for the box, we may attach the "unit" attribute to the "box" group itself rather than to edges/offset.
Regards, Felix
[Prev in Thread] | Current Thread | [Next in Thread] |