$treeview $search $mathjax $extrastylesheet
librsync
2.3.0
$projectbrief
|
$projectbrief
|
$searchbox |
00001 * We have a few functions to do with reading a netint, stashing 00002 it somewhere, then moving into a different state. Is it worth 00003 writing generic functions for that, or would it be too confusing? 00004 00005 * Duplicate block handling. Currently duplicate blocks are included in 00006 the signature, but we only put the first duplicate block in the 00007 hashtable so the delta only includes references to the first block. 00008 This can result in sub-optimal copy commands, breaking single large 00009 copies with duplicate blocks into multiple copies referencing the 00010 earlier copy of the block. However, this could also make patching use 00011 the disk cache more effectively. This solution is probably fine, 00012 particularly given how small copy instructions are, but there might be 00013 solutions for improving copy commands for long runs of duplicate blocks. 00014 00015 * Optimisations and code cleanups; 00016 00017 scoop.c: Scoop needs major refactor. Perhaps the API needs 00018 tweaking? 00019 00020 rsync.h: rs_buffers_s and rs_buffers_t should be one typedef? 00021 00022 * Just how useful is rs_job_drive anyway? 00023 00024 mdfour.c: This code has a different API to the RSA code in libmd 00025 and is coupled with librsync in unhealthy ways (trace?). Recommend 00026 changing to RSA API? 00027 00028 * Don't use the rs_buffers_t structure. 00029 00030 There's something confusing about the existence of this structure. 00031 In part it may be the name. I think people expect that it will be 00032 something that behaves like a FILE* or C++ stream, and it really 00033 does not. Also, the structure does not behave as an object: it's 00034 really just a shorthand for passing values in to the encoding 00035 routines, and so does not have a lot of identity of its own. 00036 00037 An alternative might be 00038 00039 result = rs_job_iter(job, 00040 in_buf, &in_len, in_is_ending, 00041 out_buf, &out_len); 00042 00043 where we update the length parameters on return to show how much we 00044 really consumed. 00045 00046 One technicality here will be to restructure the code so that the 00047 input buffers are passed down to the scoop/tube functions that need 00048 them, which are relatively deeply embedded. I guess we could just 00049 stick them into the job structure, which is becoming a kind of 00050 catch-all "environment" for poor C programmers. 00051 00052 * Meta-programming 00053 00054 * Plot lengths of each function 00055 00056 * Some kind of statistics on delta each day 00057 00058 * Encoding format 00059 00060 * Include a version in the signature and difference fields 00061 00062 * Remember to update them if we ever ship a buggy version (nah!) so 00063 that other parties can know not to trust the encoded data. 00064 00065 * abstract encoding 00066 00067 In fact, we can vary on several different variables: 00068 00069 * what signature format are we using 00070 00071 * what command protocol are we using 00072 00073 * what search algorithm are we using? 00074 00075 * what implementation version are we? 00076 00077 Some are more likely to change than others. We need a chart 00078 showing which source files depend on which variable. 00079 00080 * Encoding implementation 00081 00082 * Join up signature commands 00083 00084 * Encoding algorithm 00085 00086 * Self-referential copy commands 00087 00088 Suppose we have a file with repeating blocks. The gdiff format 00089 allows for COPY commands to extend into the *output* file so that 00090 they can easily point this out. By doing this, they get 00091 compression as well as differencing. 00092 00093 It'd be pretty simple to implement this, I think: as we produce 00094 output, we'd also generate checksums (using the search block 00095 size), and add them to the sum set. Then matches will fall out 00096 automatically, although we might have to specially allow for 00097 short blocks. 00098 00099 However, I don't see many files which have repeated 1kB chunks, 00100 so I don't know if it would be worthwhile. 00101 00102 * Extended files 00103 00104 Suppose the new file just has data added to the end. At the 00105 moment, we'll match everything but the last block of the old 00106 file. It won't match, because at the moment the search block 00107 size is only reduced at the end of the *new* file. This is a 00108 little inefficient, because ideally we'd know to look for the 00109 last block using the shortened length. 00110 00111 This is a little hard to implement, though perhaps not 00112 impossible. The current rolling search algorithm can only look 00113 for one block size at any time. Can we do better? Can we look 00114 for all block lengths that could match anything? 00115 00116 Remember also that at the moment we don't send the block length 00117 in the signature; it's implied by the length of the new block 00118 that it matches. This is kind of cute, and importantly helps 00119 reduce the length of the signature. 00120 00121 * State-machine searching 00122 00123 Building a state machine from a regular expression is a brilliant 00124 idea. (I think *The Practice of Programming* walks through the 00125 construction of this at a fairly simple level.) 00126 00127 In particular, we can search for any of a large number of 00128 alternatives in a very efficient way, with much less effort than 00129 it would take to search for each the hard way. Remember also the 00130 string-searching algorithms and how much time they can take. 00131 00132 I wonder if we can use similar principles here rather than the 00133 current simple rolling-sum mechanism? Could it let us match 00134 variable-length signatures? 00135 00136 * Support gzip compression of the difference stream. Does this 00137 belong here, or should it be in the client and librsync just have 00138 an interface that lets it cleanly plug in? 00139 00140 I think if we're going to just do plain gzip, rather than 00141 rsync-gzip, then it might as well be external. 00142 00143 * rsync-gzip: preload with the omitted text so as to get better 00144 compression. Abo thinks this gets significantly better 00145 compression. On the other hand we have to important and maintain 00146 our own zlib fork, at least until we can persuade the upstream to 00147 take the necessary patch. Can that be done? 00148 00149 abo says 00150 00151 It does get better compression, but at a price. I actually 00152 think that getting the code to a point where a feature like 00153 this can be easily added or removed is more important than the 00154 feature itself. Having generic pre and post processing layers 00155 for hit/miss data would be useful. I would not like to see it 00156 added at all if it tangled and complicated the code. 00157 00158 It also doesn't require a modified zlib... pysync uses the 00159 standard zlib to do it by compressing the data, then throwing 00160 it away. I don't know how much benefit the rsync modifications 00161 to zlib actually are, but if I was implementing it I would 00162 stick to a stock zlib until it proved significantly better to 00163 go with the fork. 00164 00165 * Licensing 00166 00167 Will the GNU Lesser GPL work? Specifically, will it be a problem 00168 in distributing this with Mozilla or Apache? 00169 00170 * Checksums 00171 00172 * Do we really need to require that signatures arrive after the 00173 data they describe? Does it make sense in HTTP to resume an 00174 interrupted transfer? 00175 00176 I hope we can do this. If we can't, however, then we should 00177 relax this constraint and allow signatures to arrive before the 00178 data they describe. (Really? Do we care?) 00179 00180 * Allow variable-length checksums in the signature; the signature 00181 will have to describe the length of the sums and we must compare 00182 them taking this into account. 00183 00184 * Testing 00185 00186 * Just more testing in general. 00187 00188 * Test broken pipes and that IO errors are handled properly. 00189 00190 * Test files >2GB, >4GB. Presumably these must be done in streams 00191 so that the disk requirements to run the test suite are not too 00192 ridiculous. I wonder if it will take too long to run these 00193 tests? Probably, but perhaps we can afford to run just one 00194 carefully-chosen test. 00195 00196 * Fuzz instruction streams. <https://code.google.com/p/american-fuzzy-lop/>? 00197 00198 * Generate random data; do random mutations. 00199 00200 * Try different block lengths. 00201 00202 * Tests should fail if they can't find their inputs, or have zero 00203 inputs: at present they tend to succeed by default. 00204 00205 * Test varying strong-sum inputs: default, short, long. 00206 00207 * Security audit 00208 00209 * If this code was to read differences or sums from random machines 00210 on the network, then it's a security boundary. Make sure that 00211 corrupt input data can't make the program crash or misbehave. 00212 00213 * Long files 00214 00215 * How do we handle the large signatures required to support large 00216 files? In particular, how do we choose an appropriate block size 00217 when the length is unknown? Perhaps we should allow a way for 00218 the signature to scale up as it grows. 00219 00220 * Perhaps make extracted signatures still be wrapped in commands. 00221 What would this lead to? 00222 00223 * We'd know how much signature data we expect to read, rather than 00224 requiring it to be terminated by the caller.