|
From: | Vadim Zeitlin |
Subject: | [lmi] [PATCH] CensusView::Update() optimizations |
Date: | Sat, 28 Feb 2015 02:04:05 +0100 |
Date: | Fri, 27 Feb 2015 15:05:34 +0100 |
Hello, This is another long message because I've been writing it on and off for a couple of months, but it can be summarized like this: 1. Please apply the patch 1 to speed up column updating by a factor of ~4 in the variable width mode. 2. Please apply the patches 2-5 to speed this up by a factor of ~2 in the fixed width mode in addition (i.e. variable width is ~8 times faster). These patches should be completely safe to apply. 3. Please apply the patch 6 if we can assume that all Input cells are homogeneous, i.e. have the same members, as I believe they do. This brings the gains to ~10 times for the fixed width and ~25 times for the variable width case. 4. It is still not clear if Update() is fast enough even after all these changes to be called after every cell change or not. This depends on the typical census size and speed of the machines on which lmi is used. Full details and explanations follow below. On Sun, 02 Nov 2014 01:40:21 +0000 Greg Chicares <address@hidden> wrote: GC> On 2014-11-01 23:57Z, Vadim Zeitlin wrote: GC> > On Sat, 01 Nov 2014 22:05:38 +0000 Greg Chicares <address@hidden> wrote: GC> > GC> > GC> For a given column in the census manager, suppose that all cells GC> > GC> have the same value, except for one differing cell. If we edit GC> > GC> that differing cell and remove that difference, then all cells GC> > GC> are now the same, and the column is no longer to be shown. When GC> > GC> is the display to be thus updated? IOW, when should the function GC> > GC> that updates the display, CensusView::Update(), be called? GC> > GC> > I think the best solution for this would involve doing the work done by GC> > update_class_names() on a separate thread and then requesting that the main GC> > thread updates the columns only if really necessary -- which should be a GC> > relatively rare occurrence. GC> GC> That idea has two parts: GC> GC> (1) Threads: I'm reluctant to go there at this time, fearing a host GC> of problems. GC> GC> (2) Update columns only if necessary: yes, I'd like to do that. The first attached patch does exactly this and, as expected, it does have a significant effect when switching to the "Varying column widths" mode. With MSVC 12 release build I observed a 45% gain for the first switch and 90% gain for the subsequent ones (again, this is not surprising: before, the column widths cache was cleared every time as the columns were recreated, now that it is not cleared any more the subsequent column widths recalculations reuse it and so are significantly faster). Unfortunately the gains are less impressive with gcc 3.4.5: there we gain only ~40% for the first switch and ~75% for the subsequent ones. I suspect this is due to poorer quality of optimization in this ancient version of gcc, as the code executed inside lmi itself seems to take a greater proportion of time compared to the time spent in Windows functions themselves (which is the same for MSVC and gcc, of course), so one can hope that upgrading to a newer gcc version might improve things for it. Nevertheless, this patch doesn't have any drawbacks and brings the time to switch to "Varying widths" mode down from ~5.5s to ~3.4s for the first time and ~1.4s for the subsequent ones, i.e. is pretty noticeable for the user and so should be applied. [As an aside, all the benchmarking were done by taking the second smallest measurement out of 20 runs. The rationale for taking this number instead of the average one is that on a not quite idle machine the time can be only increased by the other processes executing on it and so the smallest time is more representative of the intrinsic speed of the code. And the reason for not taking the smallest one just to exclude any suspicious outliers (but in practice the smallest and second smallest one were always very close or even identical anyhow).] GC> > GC> But there has to be a way to trigger Update() aside from the GC> > GC> triggers documented above, so I asked myself what a reasonably GC> > GC> ingenious user would try. The answer seemed obvious: GC> > GC> http://svn.savannah.nongnu.org/viewvc?view=rev&root=lmi&revision=6016 GC> > GC> > This is indeed ingenious, but I am pretty sure fully automatic solution is GC> > achievable as well. Definitely so if you're prepared to use a background GC> > thread and perhaps even without it. GC> > GC> > Please let me know if it's worth investigating this further, GC> GC> Yes, please. I spent quite some on profiling this with MSVC-specific tools, as they are rather more convenient than gcov and definitely much faster to use, and then optimizing it and checking the results under both MSVC and gcc and I give the brief summary of this long process below. If you're interested in all the exact numbers, I made them available on https://docs.google.com/spreadsheets/d/12bXWDT3SO9hMOLkFjeRRhqLi8gsGXWBUlzR0-ZpmnCM/ The three sets of benchmarks correspond to the MSVC 12 release build of lmi running natively, the same binary running inside a VM and the binary built with gcc 3.4.5 also running inside the VM. The "delta" columns correspond to the gain compared to the previous row, the baseline Update() time in the fixed widths case and the baseline Update() time in the variable widths case respectively. The MSVC VM numbers also have an extra column measuring the slowdown compared to the native version, i.e. the VM overhead. As you can see, it varies a lot: from just 102% (i.e. almost no slowdown at all, maybe because VMware paravirtualizes the Windows text measuring APIs where most of the time is spent in this line?) for the variable width case baseline to ~170% in the other cases (I have no real explanation for this). Finally, the gcc numbers have an extra column measuring the slowdown compared to the MSVC build running inside the same VM. Assuming that more recent g++ versions optimize code comparably to MSVC (and I believe this is roughly speaking the case), this could be seen as an estimation of the potential performance gains from upgrading to gcc4/gcc5. Anyhow, here is the promised "brief" summary: to begin with, I measured the time of Update() execution using the same methodology as above (i.e. second smallest of 20 runs), which went into the two first rows of the spreadsheet above. Then I profiled it and saw that its execution time was completely dominated by update_visible_columns(), which took 98% of its running time, and 99.95% (this is a measurement, not an exaggeration) of its own running time was spent in column_value_varies_across_cells(). The patch from above helps a little with Update() speed even in the fixed widths case, but the gain is small, of order of 6-7%. Looking at the function, there was one obvious micro-optimization to do in it, which was to move an invariant value out of the loop, as this is clearly not done by neither MSVC (which gains 23% outside of VM and 16% inside it after applying this patch) nor gcc (which gains 14%), so the second patch is definitely worth applying too. Unfortunately even with both of these patches applied, Update() still took 355ms in the native MSVC build on my relatively fast machine and ~1100ms for the gcc build inside the VM which is probably much closer to the average client machine performance than my workstation. So even though we gained between 20 and 30%, the function is still much too slow to be called implicitly in response to the user action, e.g. after deleting a cell, as such operations are supposed to not take more than 100ms, per the usual rule of thumb. So I had no choice but to start looking outside of the CensusView code and, notably, at MemberSymbolTable and any_member<> itself. My first change here is the third patch and is not very interesting from performance perspective, but in addition to speeding up the code a little, it also makes it simpler, so I think it should still be applied. The patch just removes a redundant field, which was stored in the any_member<> itself as well as in its content_ field. The next optimization concerns MemberSymbolTable: it uses std::map<> which is the best general purpose container for the mix of insert and lookup operations but not at all the best choice for the use pattern of this class in which the items are inserted in it only once and then looked up many times. The theoretically best container for such use is a sorted vector and this is what the patch #5 does (it relies on the previously posted patch defining a working swap() for any_member, which I include as the patch #4 in this series for completeness). I was pretty upset with the effect of this patch with MSVC where the gain is just 2%, but gcc gains almost 30% with it. This is probably more due to the poor efficiency of its std::map implementation than to anything else but it's still good news and brings the cumulative gain to almost 50% compared to the baseline. But Update() is still too slow. As an aside, I also experimented with C++11 std::unordered_map<> (which I could only test in the MSVC build, of course, as gcc 3.4 doesn't have it). Unlike the patch #5, this was a trivial change, I basically just replaced std::map with std::unordered_map and added an #include line without changing anything else -- but this was enough to gain more than 20%. It turns out that the sorted vector is still better, see below, but this does show that unordered_map<> is a much better choice than map<> for such use pattern. I also have to mention that before writing my own sorted vector, I tried using boost::container::flat_map<> which also resulted in ~20% gain compared with std::map. Unfortunately, flat_map is not in Boost 1.47 used by lmi and even though I could reuse its (header-only) implementation, it relies on the map elements being either copy-assignable or move-assignable and any_member<> is not copy-assignable as discussed at length in the other thread, so I had to use C++11 move assignment operator to make lmi work with it. And because of this reliance on C++11 support I couldn't use this class, finally. But the annoying thing is that it is so much faster than my own sorted vector and I have no idea why. I thought that this could be due to a better memory access locality as flat_map stores the keys and the values together while I store the keys in the (already existing) separate vector, so I tried doing the same thing and also storing the pairs of (name, any_member) in the vector instead of just the latter, but this only made my version a very tiny bit slower. I also saw that flat_map used its own binary search implementation instead of std::lower_bound() so I checked if I could be paying a penalty for using the standard function, but replacing it with the hand written code gained me absolutely nothing. So this remains a mystery and I'll try to return to this later because this 20% discrepancy between the very similar implementations definitely must have some explanation. But for now I'm out of ideas... And we're getting to the end of the story because with the vector-based implementation of MemberSymbolTable I could finally apply the optimization which I couldn't use before but which is clearly very helpful: the loop in column_value_varies_across_cells() currently searches for the member with the given name on each loop iteration, but all cells in Input have the same structure (please correct me if I'm wrong, this assumption is absolutely vital) and so we can compute the member offset only once, outside the loop, now and just reuse it later. This speeds things up by 70-80%, i.e. we get up to 5-fold improvement. And the best thing is that this time the improvement is greatest for the gcc build and brings it almost to the same speed as MSVC. I have no idea why does gcc profit from this so much more than MSVC but then I'm quite convinced that if you don't obtain some completely unexpected results when doing benchmarks, you must be doing something wrong -- so at least it's reassuring from this point of view. The conclusion is that the final version is more than 10 times faster than the original one in the fixed width case and more than 25 times faster in the variable width one, when using gcc. Unfortunately this may still not be quite enough to start calling Update() in response to the user actions as it is still not really under the 100ms barrier and in interactive testing you easily feel the sluggishness with the big census. I don't know if the ~4000 rows in this census are the (realistic) upper limit or more of an average. In the latter case we definitely still can't afford making unconditional Update() calls but in the former it might just pass. Optimizing Update() further is going to be difficult, the execution time is now spread pretty evenly among various simple functions, notably all sorts of comparison operators which risk to be difficult to optimize, e.g. tn_range<> one takes almost 10% of Update() time but it's just a simple comparison. My best hope probably remains to find out what does flat_map do differently to achieve so much better performance. Please let me know if you have any questions about this work or the patches and I hope they can be applied because I think lmi users should appreciate them. Thanks in advance, VZ
0001-Only-recreate-census-view-columns-if-really-necessar.patch
Description: Text document
0002-Micro-optimization-in-CensusView-column_value_varies.patch
Description: Text document
0003-Remove-unused-any_member-object_-field.patch
Description: Text document
0004-Define-global-swap-for-any_member.patch
Description: Text document
0005-Use-a-sorted-vector-instead-of-std-map-in-MemberSymb.patch
Description: Text document
0006-Allow-computing-MemberSymbolTable-member-access-inde.patch
Description: Text document
[Prev in Thread] | Current Thread | [Next in Thread] |