看板 DFBSD_kernel 關於我們 聯絡資訊
Ok, after much experimentation I've figured out what is going on. First, why is UFS skewed towards writing to the extreme detriment of reads while HAMMER is skewed towards reading to the extreme detriment of writes? In a word: flushing meta-data out in UFS doesn't require as many locks to be held as flushing meta-data out in HAMMER does. The issue in UFS can be somewhat controlled by an I/O scheduler but it isn't straightforward due to the way disk drives handle write I/O's verses read I/O's. Write I/O's tend to get acknowledged instantly by the hard drive up until the point where the hard drive's own ram cache fills up with dirty data, and there is no way to gauge and control the backlog. One must also ensure that some hardware protocol tags are reserved for reading and some are reserved for writing so read I/O isn't able to completely stall out write I/O or vise versa. DragonFly does this in its CAM layer (I don't know about FreeBSD, it is something I added recently). It's very difficult to control write bandwidth in an I/O scheduler without simulating/calculating probable seek times for random vs linear write I/O. For HAMMER the problem is that HAMMER's flusher threads are constantly getting stalled out by B-Tree locks being held by the ~100 reader threads (in the blogbench test). Fixing this in HAMMER cannot be done in the I/O scheduler, because stalling out read I/O's in the I/O scheduler (in order to try to make more bw available for writing) will simply cause the related B-Tree locks to be held even longer and cause write activity to actually go down. The fix has to be in HAMMER itself. NOTE: I cannot solve this by giving the flusher's exclusive locks priority over the frontend's shared locks without creating major 3-thread deadlock chains, and using exclusive locks in the readers results in reduced read concurrency. -- So, I am going to commit some experimental code to HAMMER which tries to manage the locking conflicts between the frontend reader threads and the backend flusher threads. I am going to do this by creating a pulse-width modulated time-domain multiplexer in HAMMER which tries to 'slot in' reads and writes based on the number of inodes backlogged in the flusher. Basically the idea of using a PWM is this: You take a fixed period of time, say 1/5 of a second: [----------------------------------------------] You alot a portion of the time slice to the backend flusher and the remainder to the frontend. [wwwwwwrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrr] Flusher lightly loaded [wwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwrrrrrrr] Flusher heavily loaded Even though read I/O operations in a heavily loaded system can stall for much more than 1/5 of a second causing the read operations to delays a certain number of ticks before being initiated gives the flusher a chance to win locking conflicts and thus the flusher is able to gain performance over the frontend reads. -- This change isn't just to help blogbench out. It also appears to solve some major issues with namecache stalls that occur when HAMMER is heavily write-loaded, and issues with things like vi ':wq' operations (which fsync()) seem to be improved. My commit message also mentions it helping with 'ls' and 'find' but I think the 'ls' and 'find' issue needs a bit more work. The effect on the blogbench tests is basically to improve write performance a little at the cost of read performance. This tradeoff is due to hard drive seek times and is unavoidable. read write For blogbench in stage 2 after the system caches are blown out. Approximate values only. R articles vs W articles. UFS: 600 4000 (freebsd) HAMMER BEFORE: 20000 50 (dragonfly) HAMMER AFTER: 2500 150 (dragonfly) <-- this is an improvement even though it may not seem that way. As you can see HAMMER still prioritizes reads, and that is precisely what I want to have happen... reads are far more important than writes. We don't want writes to stall out completely but neither do we want writes to be able to stall reads out completely. In the blogbench test one basically has ~100 threads issuing random reads, but the read issued by each thread is for a while file and is thus linear. In otherwords, increasing the write activity by a little decreases the disk bandwidth (due to spindles/seeks) by a lot. So, Francois, lets see how the stuff I committed works out. You need to remove the temporary patches I forwarded to you on IRC. What I committed is the final version. I dunno if the graphs will look any better since they are so badly skewed towards the pre-system-cache- blowout numbers, but things should run more smoothly. -Matt