contrib / diff-highlight / READMEon commit diff-highlight: document some non-optimal cases (a0b676a)
   1diff-highlight
   2==============
   3
   4Line oriented diffs are great for reviewing code, because for most
   5hunks, you want to see the old and the new segments of code next to each
   6other. Sometimes, though, when an old line and a new line are very
   7similar, it's hard to immediately see the difference.
   8
   9You can use "--color-words" to highlight only the changed portions of
  10lines. However, this can often be hard to read for code, as it loses
  11the line structure, and you end up with oddly formatted bits.
  12
  13Instead, this script post-processes the line-oriented diff, finds pairs
  14of lines, and highlights the differing segments.  It's currently very
  15simple and stupid about doing these tasks. In particular:
  16
  17  1. It will only highlight hunks in which the number of removed and
  18     added lines is the same, and it will pair lines within the hunk by
  19     position (so the first removed line is compared to the first added
  20     line, and so forth). This is simple and tends to work well in
  21     practice. More complex changes don't highlight well, so we tend to
  22     exclude them due to the "same number of removed and added lines"
  23     restriction. Or even if we do try to highlight them, they end up
  24     not highlighting because of our "don't highlight if the whole line
  25     would be highlighted" rule.
  26
  27  2. It will find the common prefix and suffix of two lines, and
  28     consider everything in the middle to be "different". It could
  29     instead do a real diff of the characters between the two lines and
  30     find common subsequences. However, the point of the highlight is to
  31     call attention to a certain area. Even if some small subset of the
  32     highlighted area actually didn't change, that's OK. In practice it
  33     ends up being more readable to just have a single blob on the line
  34     showing the interesting bit.
  35
  36The goal of the script is therefore not to be exact about highlighting
  37changes, but to call attention to areas of interest without being
  38visually distracting.  Non-diff lines and existing diff coloration is
  39preserved; the intent is that the output should look exactly the same as
  40the input, except for the occasional highlight.
  41
  42Use
  43---
  44
  45You can try out the diff-highlight program with:
  46
  47---------------------------------------------
  48git log -p --color | /path/to/diff-highlight
  49---------------------------------------------
  50
  51If you want to use it all the time, drop it in your $PATH and put the
  52following in your git configuration:
  53
  54---------------------------------------------
  55[pager]
  56        log = diff-highlight | less
  57        show = diff-highlight | less
  58        diff = diff-highlight | less
  59---------------------------------------------
  60
  61Bugs
  62----
  63
  64Because diff-highlight relies on heuristics to guess which parts of
  65changes are important, there are some cases where the highlighting is
  66more distracting than useful. Fortunately, these cases are rare in
  67practice, and when they do occur, the worst case is simply a little
  68extra highlighting. This section documents some cases known to be
  69sub-optimal, in case somebody feels like working on improving the
  70heuristics.
  71
  721. Two changes on the same line get highlighted in a blob. For example,
  73   highlighting:
  74
  75----------------------------------------------
  76-foo(buf, size);
  77+foo(obj->buf, obj->size);
  78----------------------------------------------
  79
  80   yields (where the inside of "+{}" would be highlighted):
  81
  82----------------------------------------------
  83-foo(buf, size);
  84+foo(+{obj->buf, obj->}size);
  85----------------------------------------------
  86
  87   whereas a more semantically meaningful output would be:
  88
  89----------------------------------------------
  90-foo(buf, size);
  91+foo(+{obj->}buf, +{obj->}size);
  92----------------------------------------------
  93
  94   Note that doing this right would probably involve a set of
  95   content-specific boundary patterns, similar to word-diff. Otherwise
  96   you get junk like:
  97
  98-----------------------------------------------------
  99-this line has some -{i}nt-{ere}sti-{ng} text on it
 100+this line has some +{fa}nt+{a}sti+{c} text on it
 101-----------------------------------------------------
 102
 103   which is less readable than the current output.
 104
 1052. The multi-line matching assumes that lines in the pre- and post-image
 106   match by position. This is often the case, but can be fooled when a
 107   line is removed from the top and a new one added at the bottom (or
 108   vice versa). Unless the lines in the middle are also changed, diffs
 109   will show this as two hunks, and it will not get highlighted at all
 110   (which is good). But if the lines in the middle are changed, the
 111   highlighting can be misleading. Here's a pathological case:
 112
 113-----------------------------------------------------
 114-one
 115-two
 116-three
 117-four
 118+two 2
 119+three 3
 120+four 4
 121+five 5
 122-----------------------------------------------------
 123
 124   which gets highlighted as:
 125
 126-----------------------------------------------------
 127-one
 128-t-{wo}
 129-three
 130-f-{our}
 131+two 2
 132+t+{hree 3}
 133+four 4
 134+f+{ive 5}
 135-----------------------------------------------------
 136
 137   because it matches "two" to "three 3", and so forth. It would be
 138   nicer as:
 139
 140-----------------------------------------------------
 141-one
 142-two
 143-three
 144-four
 145+two +{2}
 146+three +{3}
 147+four +{4}
 148+five 5
 149-----------------------------------------------------
 150
 151   which would probably involve pre-matching the lines into pairs
 152   according to some heuristic.