1908b92f3804c9d0e5951ee8050c1de312fb30a4
   1Tweaking diff output
   2====================
   3June 2005
   4
   5
   6Introduction
   7------------
   8
   9The diff commands git-diff-index, git-diff-files, and
  10git-diff-tree can be told to manipulate differences they find
  11in unconventional ways before showing diff(1) output.  The
  12manipulation is collectively called "diffcore transformation".
  13This short note describes what they are and how to use them to
  14produce diff outputs that are easier to understand than the
  15conventional kind.
  16
  17
  18The chain of operation
  19----------------------
  20
  21The git-diff-* family works by first comparing two sets of
  22files:
  23
  24 - git-diff-index compares contents of a "tree" object and the
  25   working directory (when '\--cached' flag is not used) or a
  26   "tree" object and the index file (when '\--cached' flag is
  27   used);
  28
  29 - git-diff-files compares contents of the index file and the
  30   working directory;
  31
  32 - git-diff-tree compares contents of two "tree" objects.
  33
  34In all of these cases, the commands themselves compare
  35corresponding paths in the two sets of files.  The result of
  36comparison is passed from these commands to what is internally
  37called "diffcore", in a format similar to what is output when
  38the -p option is not used.  E.g.
  39
  40------------------------------------------------
  41in-place edit  :100644 100644 bcd1234... 0123456... M file0
  42create         :000000 100644 0000000... 1234567... A file4
  43delete         :100644 000000 1234567... 0000000... D file5
  44unmerged       :000000 000000 0000000... 0000000... U file6
  45------------------------------------------------
  46
  47The diffcore mechanism is fed a list of such comparison results
  48(each of which is called "filepair", although at this point each
  49of them talks about a single file), and transforms such a list
  50into another list.  There are currently 6 such transformations:
  51
  52- diffcore-pathspec
  53- diffcore-break
  54- diffcore-rename
  55- diffcore-merge-broken
  56- diffcore-pickaxe
  57- diffcore-order
  58
  59These are applied in sequence.  The set of filepairs git-diff-\*
  60commands find are used as the input to diffcore-pathspec, and
  61the output from diffcore-pathspec is used as the input to the
  62next transformation.  The final result is then passed to the
  63output routine and generates either diff-raw format (see Output
  64format sections of the manual for git-diff-\* commands) or
  65diff-patch format.
  66
  67
  68diffcore-pathspec
  69-----------------
  70
  71The first transformation in the chain is diffcore-pathspec, and
  72is controlled by giving the pathname parameters to the
  73git-diff-* commands on the command line.  The pathspec is used
  74to limit the world diff operates in.  It removes the filepairs
  75outside the specified set of pathnames.
  76
  77Implementation note.  For performance reasons, git-diff-tree
  78uses the pathname parameters on the command line to cull set of
  79filepairs it feeds the diffcore mechanism itself, and does not
  80use diffcore-pathspec, but the end result is the same.
  81
  82
  83diffcore-break
  84--------------
  85
  86The second transformation in the chain is diffcore-break, and is
  87controlled by the -B option to the git-diff-* commands.  This is
  88used to detect a filepair that represents "complete rewrite" and
  89break such filepair into two filepairs that represent delete and
  90create.  E.g.  If the input contained this filepair:
  91
  92------------------------------------------------
  93:100644 100644 bcd1234... 0123456... M file0
  94------------------------------------------------
  95
  96and if it detects that the file "file0" is completely rewritten,
  97it changes it to:
  98
  99------------------------------------------------
 100:100644 000000 bcd1234... 0000000... D file0
 101:000000 100644 0000000... 0123456... A file0
 102------------------------------------------------
 103
 104For the purpose of breaking a filepair, diffcore-break examines
 105the extent of changes between the contents of the files before
 106and after modification (i.e. the contents that have "bcd1234..."
 107and "0123456..." as their SHA1 content ID, in the above
 108example).  The amount of deletion of original contents and
 109insertion of new material are added together, and if it exceeds
 110the "break score", the filepair is broken into two.  The break
 111score defaults to 50% of the size of the smaller of the original
 112and the result (i.e. if the edit shrinks the file, the size of
 113the result is used; if the edit lengthens the file, the size of
 114the original is used), and can be customized by giving a number
 115after "-B" option (e.g. "-B75" to tell it to use 75%).
 116
 117
 118diffcore-rename
 119---------------
 120
 121This transformation is used to detect renames and copies, and is
 122controlled by the -M option (to detect renames) and the -C option
 123(to detect copies as well) to the git-diff-* commands.  If the
 124input contained these filepairs:
 125
 126------------------------------------------------
 127:100644 000000 0123456... 0000000... D fileX
 128:000000 100644 0000000... 0123456... A file0
 129------------------------------------------------
 130
 131and the contents of the deleted file fileX is similar enough to
 132the contents of the created file file0, then rename detection
 133merges these filepairs and creates:
 134
 135------------------------------------------------
 136:100644 100644 0123456... 0123456... R100 fileX file0
 137------------------------------------------------
 138
 139When the "-C" option is used, the original contents of modified
 140files and contents of unchanged files are considered as
 141candidates of the source files in rename/copy operation, in
 142addition to the deleted files.  If the input were like these
 143filepairs, that talk about a modified file fileY and a newly
 144created file file0:
 145
 146------------------------------------------------
 147:100644 100644 0123456... 1234567... M fileY
 148:000000 100644 0000000... 0123456... A file0
 149------------------------------------------------
 150
 151the original contents of fileY and the resulting contents of
 152file0 are compared, and if they are similar enough, they are
 153changed to:
 154
 155------------------------------------------------
 156:100644 100644 0123456... 1234567... M fileY
 157:100644 100644 0123456... 0123456... C100 fileY file0
 158------------------------------------------------
 159
 160In both rename and copy detection, the same "extent of changes"
 161algorithm used in diffcore-break is used to determine if two
 162files are "similar enough", and can be customized to use
 163similarity score different from the default 50% by giving a
 164number after "-M" or "-C" option (e.g. "-M8" to tell it to use
 1658/10 = 80%).
 166
 167Note.  When the "-C" option is used with `\--find-copies-harder`
 168option, git-diff-\* commands feed unmodified filepairs to
 169diffcore mechanism as well as modified ones.  This lets the copy
 170detector consider unmodified files as copy source candidates at
 171the expense of making it slower.  Without `\--find-copies-harder`,
 172git-diff-\* commands can detect copies only if the file that was
 173copied happened to have been modified in the same changeset.
 174
 175
 176diffcore-merge-broken
 177---------------------
 178
 179This transformation is used to merge filepairs broken by
 180diffcore-break, and were not transformed into rename/copy by
 181diffcore-rename, back into a single modification.  This always
 182runs when diffcore-break is used.
 183
 184For the purpose of merging broken filepairs back, it uses a
 185different "extent of changes" computation from the ones used by
 186diffcore-break and diffcore-rename.  It counts only the deletion
 187from the original, and does not count insertion.  If you removed
 188only 10 lines from a 100-line document, even if you added 910
 189new lines to make a new 1000-line document, you did not do a
 190complete rewrite.  diffcore-break breaks such a case in order to
 191help diffcore-rename to consider such filepairs as candidate of
 192rename/copy detection, but if filepairs broken that way were not
 193matched with other filepairs to create rename/copy, then this
 194transformation merges them back into the original
 195"modification".
 196
 197The "extent of changes" parameter can be tweaked from the
 198default 80% (that is, unless more than 80% of the original
 199material is deleted, the broken pairs are merged back into a
 200single modification) by giving a second number to -B option,
 201like these:
 202
 203* -B50/60 (give 50% "break score" to diffcore-break, use 60%
 204  for diffcore-merge-broken).
 205
 206* -B/60 (the same as above, since diffcore-break defaults to 50%).
 207
 208Note that earlier implementation left a broken pair as a separate
 209creation and deletion patches.  This was unnecessary hack and
 210the latest implementation always merges all the broken pairs
 211back into modifications, but the resulting patch output is
 212formatted differently to still let the reviewing easier for such
 213a complete rewrite by showing the entire contents of old version
 214prefixed with '-', followed by the entire contents of new
 215version prefixed with '+'.
 216
 217
 218diffcore-pickaxe
 219----------------
 220
 221This transformation is used to find filepairs that represent
 222changes that touch a specified string, and is controlled by the
 223-S option and the `\--pickaxe-all` option to the git-diff-*
 224commands.
 225
 226When diffcore-pickaxe is in use, it checks if there are
 227filepairs whose "original" side has the specified string and
 228whose "result" side does not.  Such a filepair represents "the
 229string appeared in this changeset".  It also checks for the
 230opposite case that loses the specified string.
 231
 232When `\--pickaxe-all` is not in effect, diffcore-pickaxe leaves
 233only such filepairs that touches the specified string in its
 234output.  When `\--pickaxe-all` is used, diffcore-pickaxe leaves all
 235filepairs intact if there is such a filepair, or makes the
 236output empty otherwise.  The latter behaviour is designed to
 237make reviewing of the changes in the context of the whole
 238changeset easier.
 239
 240
 241diffcore-order
 242--------------
 243
 244This is used to reorder the filepairs according to the user's
 245(or project's) taste, and is controlled by the -O option to the
 246git-diff-* commands.
 247
 248This takes a text file each of whose line is a shell glob
 249pattern.  Filepairs that match a glob pattern on an earlier line
 250in the file are output before ones that match a later line, and
 251filepairs that do not match any glob pattern are output last.
 252
 253As an example, typical orderfile for the core GIT probably
 254would look like this:
 255
 256------------------------------------------------
 257    README
 258    Makefile
 259    Documentation
 260    *.h
 261    *.c
 262    t
 263------------------------------------------------
 264