h5md-user
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [h5md-user] Unit attribute versus non-dimensionless quantities


From: Felix Höfling
Subject: Re: [h5md-user] Unit attribute versus non-dimensionless quantities
Date: Wed, 31 Jul 2013 10:24:54 +0200
User-agent: Opera Mail/12.15 (Linux)

Am 31.07.2013, 04:39 Uhr, schrieb Peter Colberg <address@hidden>:

Hi all,

Upon reading the section on Time-independent data, I realized that the
specification is inconsistent with regard to the unit attribute versus
time-independent box attributes, or more generally, the unit attribute
versus time-independent non-dimensionless data stored in an attribute.

The issue is that HDF5 attributes cannot have additional metadata,
i.e., attributes cannot be attached to an attribute. If the length is
chosen as a non-dimensionless quantity, then the box edges and offset
require a unit attribute. However, if the box is fixed in time, the
box edges and offset are stored as attributes, and thus cannot have
a unit attribute.

A solution would be to store the box edges or offset as either a data
group (time-dependent), a dataset (time-independent with unit), or an
attribute (time-independent without unit). Further, the specification
could state that a dimensionless time-independent quantity is stored
as either an attribute or a dataset (depending on the size of the
data), while a non-dimensionless time-independent quantity is stored
as a dataset with a unit attribute.

An alternative would be to store the box edges and offset as either
a data group or a dataset, and generally forbid attributes for storing
quantities which could potentially carry a unit. However, I would
rather avoid this restriction, since it causes a mess of attribute
and dataset quantities in my output files, despite all of them being
dimensionless time-independent quantities.

Another approach would be to dispose of the unit attribute, and store
a non-dimensionless quantity as a compound of a number (the value) and
a string (the unit). This would allow storing quantities with a unit
in both datasets and attributes, the latter being especially useful
for any kind of parameter. Since I have no experience with compound
data types, e.g., how they would affect the size and compression of a
time series, or how usable they are across programming environments,
this solution would have to be postponed until after H5MD v1.0.

How would you resolve the issue?

I feel that the safest, future-proof solution is to move the unit
attribute from the specification to the discussion section, and find
a solution later, that has proven itself in practice. I have yet to
hear from someone who is consistently using H5MD data with units in
their simulations, never mind experiments.

Peter


Hi Peter,

This is indeed a problem, very carefully observed!

I have a bit mixed feelings about ignoring completely the possibility to specify the physical unit of the data. This is a particular strength of HDF5 and, albeit almost trivial, a clear improvement over commonly used file formats. (E.g., VMD expects the input to be in Ångstrøm if I'm not mistaken.)

The combination "dataset plus unit attribute" seems to be precisely in the spirit of HDF5 to annotate data with attributes. From this point of view, I would suggest that dimensionful data are stored as HDF5 datasets. On the other hand, favouring attributes over datasets for small data makes sense as discussed extensively.

The trouble starts when the attribute itself is dimensionful. The example in the HDF5 manual
http://www.hdfgroup.org/HDF5/doc/UG/13_Attributes.html
skips this issue and simply assumes that temperature is in centigrades and pressure in atmospheres!? The problem has been recognised, see e.g.,
http://lists.hdfgroup.org/pipermail/hdf-forum_lists.hdfgroup.org/2009-March/000439.html,
but until now there appears to be no nice solution.

The solution with compound types is quite cumbersome, I would like to see h5py code implementing this (the h5py manual says almost nothing about compounds):
http://hdf-forum.184993.n3.nabble.com/attribute-units-td1526251.html

Compound types are not easy to use (and maybe not fully supported by all top-level APIs). They may prevent users from using such attributes at all, and eventually, people will continue to assume rather than to specify the unit of a dataset. In my opinion, compound attributes are clearly not a solution.

A global solution would be to add unit attributes to the h5md group for each kind of (the 7 basic) dimension: length, time, mass, temperature, electric current, amount of substance, luminous intensity:
http://en.wikipedia.org/wiki/Si_units#Base_units
But this seems to be very restrictive and it requires some physics knowledge to reconstruct the derived units. Further, it would break modularity as datasets would not be completely independent of each other.

In conclusion, I tend to Peter's solution #1: we leave the optional unit attribute as it is, but state in the general section that dimensionful, time-independent data have to be stored as datasets _if_ the unit attribute is needed. In order to avoid too many possiblities for the box, we may attach the "unit" attribute to the "box" group itself rather than to edges/offset.

Regards,

Felix



reply via email to

[Prev in Thread] Current Thread [Next in Thread]