Re: One vs many directories

emacs-orgmode
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: One vs many directories

From:	Jean Louis
Subject:	Re: One vs many directories
Date:	Mon, 23 Nov 2020 16:17:30 +0300
User-agent:	Mutt/2.0 (3d08634) (2020-11-07)
* Texas Cyberthal <texas.cyberthal@gmail.com> [2020-11-23 12:51]:
> Hi Dr. Arne,
> 
> > The only part that hits performance limits is the agenda.
> 
> Well, IIRC your Org Textmind is much smaller than mine.
> 
> > My current guess is that the agenta is slow because it has to parse all my 
> > 7500 clock entries, and it has to check the Todo states of around 1200 
> > headings.
> 
> Ouch.  I'd rather keep a "ramble log" so I can reconstruct an exactly
> honest time accounting, with discounts for partial attention, without
> worrying about fiddly clockin/outs.  At least when working from home.
> If clocking into a work site, that's different, because one can
> reasonably bill for the entire time, with minimal clock toggling.
> 

> > Did you check against filesystem limits? At 10k entries in a
> directory typical filesystems start becoming slow. That's the main
> reason I see for adding hierarchies.

>From ext4 manual:

 dir_index
  Use hashed  b-trees to speed  up name lookups  in large
  directories.   This feature  is supported  by ext3  and
  ext4 file systems, and is ignored by ext2 file systems.

 dir_nlink
  This ext4 feature allows more than 65000 subdirectories
  per directory.

I think that file systems should be unlimited and fast in relation to
that. I have ~/Maildir with over 50000 subdirectories, direct access
is very easy and fast while listing takes some time.

If file system does not allow fast access it is time to replace it
with one that does allow it.

Now I wonder of HAMMER in DragonflyBSD is also slow with 50000
directories.

My PostgreSQL database is not huge, it is when packed about 50 MB. On
the file system it is 810 MB.

To select 2469 contacts as subset of 204048 contacts that belong in
certain group does not give (usually) feeling of any delay, it looks
instant for human.

My Org work is on meta-level so my truly important headings or subtree
names are in the database. Subtrees have their various properties,
like I can place any tags there inside, like TODO or designate type of
TODO. My work is intertwined with text and Org mode mostly, but I
could use any kind of mime type or any kind of Emacs mode. Some nodes
are on file system while some are in the database.

Nodes within subtree are hyperdocuments, they are all linkable and
could be on file system or not on file system.

Everything is together in one tree and it does not matter as access to
the nodes does not go over the tree necessary. There are 19197 nodes.
To find 76 that are tagged with TODO does not give me any slight or
visible delay, definitely not even 0.2 seconds. When I press enter it
is right there.

>From the system I am using personally I am thinking that Org mode
could get its database connection so that headings and properties are
managed in the database fully while text could be managed in files. It
seems very possible.

The only thing that would be needed to add to Org in that case is some
heading tag that would uniquely designate where in the database that
heading is managed. It could be very lightly displayed on the screen
and would not be exported by default.

Something like

*** TODO Heading                                     :ID-123:

That would be all. All other meta data belonging to the heading could
be managed in the database. If heading is deleted it need not be
deleted in the database. Text belonging to heading could be managed in
the text file. Properties in the database. It can be simple database
such as GDBM if such is fast enough.

Meta data for the heading would or could be updated automatically from
time to time.

User could easily decide to show the properties in the Org file or not
to show. It does not matter much as long as :ID-123: tag is there.

All things like tags, properties, clock-in and out, schedule,
deadlines, custom_id and everything else as heading meta data could be
manageable in the database. It could be copied into new headings.
Creation of heading like this:

*** TitleRET

would automatically invoke creation of heading 124 in the database and it would 
appear as:

*** Title                                          :ID-124;

>From there on user would be doing anything as usual in the Org mode
with the difference that properties would be displayed in the updated
manner and would not be really in the Org file. They would be
displayed on the fly. Any properties and plethora of other new
properties could be included.

System would recognize automatically by saving the Org file or by
opening it:

- If headings are in the right file, if file changed its place it
would be automatically updated in the database. 

- the heading ID would always remain unique no matter what, so users
linking to any heading would not need to worry of title remaining. The
unique ID that links to heading would basically link to the database
entry. Opening the link would ask database where the entry is located
and it would open up proper Org file at proper location without
parsing the Org file in usual manner. Org file would then remain
pretty much more text than it is now.

- all the parsing and searching and indexing would be automatically
solved and human readable SQL queries could be easily customized by
user. Suddenly there would be much less commotion in work. Org files
would look much more humane readable then they are now.

> 10k entries in a directory sounds inhumanely unergonomic.  I guess my
> biggest flat name directory might eventually reach that size?  In
> which case I could just split it in the middle of the alphabet, or
> similar solution.

Like by first letters, like

~/Maildir/a/d/a/adam@example.com

Such sorting of files would be automatic. You would need to invoke a
command that sorts files that way automatically and that may also
quickly access such files automatically.

I have comand that I often use, mkdatedir that makes me directory for
the current date.

If I wish to make a database note for the day, the command today-note would 
make sure there is:

- Year 2020 (formatted how I customize it)
  - November (also formatted by custom)
    - 2020-11-23
      And entry is automatically opened for the note.

The system helps that I locate quickly the note that relates to the
day. But I can put multiple notes under same date and I can also have
same titles for those multiple notes. This is because each note has
its unique ID.

I do not know how Org handles multiple same headings when linking to
it. It does not by default:

[[Heading][Heading]]

* Heading

  Text here
  
* Heading


  More text here. But if I wish to link here I need to do hac

To me and my thinkin that is not really logical. There shall be always
unique ID for each heading. My mind is not comforted by Org system in
that sense. And I should not be thinking of the unique ID neither I
should be writing those links like [[Something][Here]] as they should
be constructed automatically.

Myself I would like to come with cursor to second Heading and capture
the link to the heading. I would kill [[Heading][Heading]] into
special memory for those links. Then I could go to any other place in
the Org file and insert it there without thinking how link looks like
or constructing the link myself as it already exists in front of me.

Constructing links by hand is fine for those which are external.

Headings of Org files could be managed by the database in background.
Then all that distributed or sparse meta information (mess) disappears. 

What people are now trying to handle with Org files is management of a
database. Only that entries of the database are pretty much
disconnected from each other, vague, in unknown positions, then Org
algorhitms try to manage that all everything what is anyway built-in
in all SQL databases. Mess is growing over time.

> A 10k entry directory is getting into enterprise territory, and I'm
> sure enterprise has tech tricks that become worthwhile at that scale.

I will try with those options dir_index and dir_nlink to see if my
50000+ directory becomes somewhat faster. Direct access to the
subdirectory is always very fast. I almost never do ls there neither
enter any such directory manually. They store emails, so I just click
one key in mutt, that key extracts the current email address such as
person@example.com and opens up ~/Maildir/person@example.com, one
among 50000. It is accessed by wanting to see previous conversation
with the contact, not by knowing what is the directory name or email
address, computer does that. It is simple system I use for years and
it is blazing fast.

> There are scaling problems in every direction: Too many files per >
> directory, too large files, too much content per heading, too many >
> headings.

To list more than 200,000 contacts does take some time but access to
the list from database is so much faster than ls in the ~/Maildir with
more than 50000 entries or subdirectories. I can relate to that. And I
still think that file systems should manage any numbers of entries.

> There are scaling problems from too much deep tree nesting, namely too
> much fiddly ambiguous manual refiling.  Solution is flat "solid name"
> directories just below feasible 10 Bins.  Work fine.

I have tried your solution and could not find the mental concept to
relate to my thinking. And I do agree that such solution could help
other people.

For images I have some command like `sort-images.lisp' that just sorts
images by its embedded dates. Many times I sort even downloads per day.

* Texas Cyberthal <texas.cyberthal@gmail.com> [2020-11-23 12:51]:
> Hi Dr. Arne,
> 
> > The only part that hits performance limits is the agenda.
> 
> Well, IIRC your Org Textmind is much smaller than mine.
> 
> > My current guess is that the agenta is slow because it has to parse all my 
> > 7500 clock entries, and it has to check the Todo states of around 1200 
> > headings.
> 
> Ouch.  I'd rather keep a "ramble log" so I can reconstruct an exactly
> honest time accounting, with discounts for partial attention, without
> worrying about fiddly clockin/outs.  At least when working from home.
> If clocking into a work site, that's different, because one can
> reasonably bill for the entire time, with minimal clock toggling.
> 

> > Did you check against filesystem limits? At 10k entries in a
> directory typical filesystems start becoming slow. That's the main
> reason I see for adding hierarchies.

>From ext4 manual:

 dir_index
  Use hashed  b-trees to speed  up name lookups  in large
  directories.   This feature  is supported  by ext3  and
  ext4 file systems, and is ignored by ext2 file systems.

 dir_nlink
  This ext4 feature allows more than 65000 subdirectories
  per directory.

I think that file systems should be unlimited and fast in relation to
that. I have ~/Maildir with over 50000 subdirectories, direct access
is very easy and fast while listing takes some time.

If file system does not allow fast access it is time to replace it
with one that does allow it.

Now I wonder of HAMMER in DragonflyBSD is also slow with 50000
directories.

My PostgreSQL database is not huge, it is when packed about 50 MB. On
the file system it is 810 MB.

To select 2469 contacts as subset of 204048 contacts that belong in
certain group does not give (usually) feeling of any delay, it looks
instant for human.

My Org work is on meta-level so my truly important headings or subtree
names are in the database. Subtrees have their various properties,
like I can place any tags there inside, like TODO or designate type of
TODO. My work is intertwined with text and Org mode mostly, but I
could use any kind of mime type or any kind of Emacs mode. Some nodes
are on file system while some are in the database.

Nodes within subtree are hyperdocuments, they are all linkable and
could be on file system or not on file system.

Everything is together in one tree and it does not matter as access to
the nodes does not go over the tree necessary. There are 19197 nodes.
To find 76 that are tagged with TODO does not give me any slight or
visible delay, definitely not even 0.2 seconds. When I press enter it
is right there.

>From the system I am using personally I am thinking that Org mode
could get its database connection so that headings and properties are
managed in the database fully while text could be managed in files. It
seems very possible.

The only thing that would be needed to add to Org in that case is some
heading tag that would uniquely designate where in the database that
heading is managed. It could be very lightly displayed on the screen
and would not be exported by default.

Something like

*** TODO Heading                                     :ID-123:

That would be all. All other meta data belonging to the heading could
be managed in the database. If heading is deleted it need not be
deleted in the database. Text belonging to heading could be managed in
the text file. Properties in the database. It can be simple database
such as GDBM if such is fast enough.

Meta data for the heading would or could be updated automatically from
time to time.

User could easily decide to show the properties in the Org file or not
to show. It does not matter much as long as :ID-123: tag is there.

All things like tags, properties, clock-in and out, schedule,
deadlines, custom_id and everything else as heading meta data could be
manageable in the database. It could be copied into new headings.
Creation of heading like this:

*** TitleRET

would automatically invoke creation of heading 124 in the database and it would 
appear as:

*** Title                                          :ID-124;

>From there on user would be doing anything as usual in the Org mode
with the difference that properties would be displayed in the updated
manner and would not be really in the Org file. They would be
displayed on the fly. Any properties and plethora of other new
properties could be included.

System would recognize automatically by saving the Org file or by
opening it:

- If headings are in the right file, if file changed its place it
would be automatically updated in the database. 

- the heading ID would always remain unique no matter what, so users
linking to any heading would not need to worry of title remaining. The
unique ID that links to heading would basically link to the database
entry. Opening the link would ask database where the entry is located
and it would open up proper Org file at proper location without
parsing the Org file in usual manner. Org file would then remain
pretty much more text than it is now.

- all the parsing and searching and indexing would be automatically
solved and human readable SQL queries could be easily customized by
user. Suddenly there would be much less commotion in work. Org files
would look much more humane readable then they are now.

> 10k entries in a directory sounds inhumanely unergonomic.  I guess my
> biggest flat name directory might eventually reach that size?  In
> which case I could just split it in the middle of the alphabet, or
> similar solution.

Like by first letters, like

~/Maildir/a/d/a/adam@example.com

Such sorting of files would be automatic. You would need to invoke a
command that sorts files that way automatically and that may also
quickly access such files automatically.

I have comand that I often use, mkdatedir that makes me directory for
the current date.

If I wish to make a database note for the day, the command today-note would 
make sure there is:

- Year 2020 (formatted how I customize it)
  - November (also formatted by custom)
    - 2020-11-23
      And entry is automatically opened for the note.

The system helps that I locate quickly the note that relates to the
day. But I can put multiple notes under same date and I can also have
same titles for those multiple notes. This is because each note has
its unique ID.

I do not know how Org handles multiple same headings when linking to
it. It does not by default:

[[Heading][Heading]]

* Heading

  Text here
  
* Heading


  More text here. But if I wish to link here I need to do hack.

To me and my thinking that is not really logical. There shall be always
unique ID for each heading. My mind is not comforted by Org system in
that sense. And I should not be thinking of the unique ID neither I
should be writing those links like [[Something][Here]] as they should
be constructed automatically.

Myself I would like to come with cursor to second Heading and capture
the link to the heading. I would kill [[Heading][Heading]] into
special memory for those links. Then I could go to any other place in
the Org file and insert it there without thinking how link looks like
or constructing the link myself as it already exists in front of me.

Constructing links by hand is fine for those which are external.

Headings of Org files could be managed by the database in background.
Then all that distributed or sparse meta information (mess) disappears. 

What people are now trying to handle with Org files is management of a
database. Only that entries of the database are pretty much
disconnected from each other, vague, in unknown positions, then Org
algorhitms try to manage that all everything what is anyway built-in
in all SQL databases. Mess is growing over time.

> A 10k entry directory is getting into enterprise territory, and I'm
> sure enterprise has tech tricks that become worthwhile at that scale.

I will try with those options dir_index and dir_nlink to see if my
50000+ directory becomes somewhat faster. Direct access to the
subdirectory is always very fast. I almost never do ls there neither
enter any such directory manually. They store emails, so I just click
one key in mutt, that key extracts the current email address such as
person@example.com and opens up ~/Maildir/person@example.com, one
among 50000. It is accessed by wanting to see previous conversation
with the contact, not by knowing what is the directory name or email
address, computer does that. It is simple system I use for years and
it is blazing fast.

> There are scaling problems in every direction: Too many files per >
> directory, too large files, too much content per heading, too many >
> headings.

To list more than 200,000 contacts does take some time but access to
the list from database is so much faster than ls in the ~/Maildir with
more than 50000 entries or subdirectories. I can relate to that. And I
still think that file systems should manage any numbers of entries.

> There are scaling problems from too much deep tree nesting, namely too
> much fiddly ambiguous manual refiling.  Solution is flat "solid name"
> directories just below feasible 10 Bins.  Work fine.

I have tried your solution and could not find the mental concept to
relate to my thinking. And I do agree that such solution could help
other people.

For images I have some command like `sort-images.lisp' that just sorts
images by its embedded dates. Many times I sort even downloads per day.

Memacs tries to solve about same problem.

Memacs
https://github.com/novoid/Memacs

That hyperlink I have selected among other 20000 hyperlinks. I could
as well send the notes to you or annotation related to the
hyperlink. I have not written the hyperlink myself, all I did is that
I have opened HyperScope, invoked completion and on the link on screen
I pressed W, it copied itself to this email. It was blazing fast as I
have accessed it by thinking Memex. Not Memacs, but Memex. Memacs was
just next to it. By thinking would still mean that I had to enter some
words that I think of. Memory is involved in that process of thinking
and accessing.

You mentioned humans know many words. If we observe the process of
knowing words, how do we access them? This time really by
thinking. But we access them how I heard of it, mostly by association
or by direct access. We see the flower and word is just there.

Do we think of a tree of knowledge first? I do not think so. And there
are memory systems that DO think of plethora of various things and
increase human memory capabilities. That is called
mnemonics. Mnemonics is based mostly on associations. It becomes
possible to remember pack of mixed 52 cards within 20 minutes and to
reliably know at which position is which card located and to replicate
the full series of cards. Mnemonics methods help human to do such
feats. Everybody can do it.

Compare now the human mind system:

- of direct access by direct association, something like I think of
  Memex but I know there is something similar in Emacs, I write Memex
  and I get Memacs, then I give reference to you.

- and there is also the system of thinking that I can locate in my
  mind a reference to Memacs even by its number or ID because that
  could be my mnemonics how I think about

How human think -- is nowhere defined and is vague. Human thinks how
they think and there may be as many versions as humans.

Computers should not be delivered any more with one built-in paradigm
only such as file system. There shall be at least several:

- file system

- meta databased approach, that involves little but more curation than
  just making a file name.

- subject or tag based approach

- Dewey decimal approach or other similar

- 10 Bins, etc.

Then user could decide to use this or that approach. Having file
managers for decades is really boring. It does not advance computing.
To say that we have hierarchical file systems by default and nothing
else shows how much we are under-developed. 

Doug Engelbart has already envisioned how files could be stored,
accessed, hyperlinked, referenced and we do not use it in that sense
today after how many years? Maybe 40 years. Computer makers and OS
makers do not really help us, there are visionaries but we do not get
file systems that helps us to access files by association or thinking
so we have to upgrade our tools for those tasks that should be built
in.

Org as a concept was already invented by Doug Engelbart before decades
and it still does not have features that I would like it to have. For
example finely grained unique ID numbers that can also relate to
paragraphs or set of paragraphs, unique or static sorting of files
repository, wide group collaboration and sharing and other concepts.
Hyperlinking already back than was sophisticated.

Highlights of the 1968 "Mother of All Demos"
https://www.dougengelbart.org/content/view/276/000/
[Prev in Thread]
Current Thread
[Next in Thread]
Re: One vs many directories, (continued)
Prev by Date: Re: buggy plantuml function
Next by Date: Re: One vs many directories
Previous by thread: Re: One vs many directories
Next by thread: Re: One vs many directories
Index(es):
- Date
- Thread