18 Future Projects

Here are some ideas for improving GNU diff and patch. The GNU project has identified some improvements as potential programming projects for volunteers. You can also help by reporting any bugs that you find.

If you are a programmer and would like to contribute something to the GNU project, please consider volunteering for one of these projects. If you are seriously contemplating work, please write to to coordinate with other volunteers.


18.1 Suggested Projects for Improving GNU diff and patch

One should be able to use GNU diff to generate a patch from any pair of directory trees, and given the patch and a copy of one such tree, use patch to generate a faithful copy of the other. Unfortunately, some changes to directory trees cannot be expressed using current patch formats; also, patch does not handle some of the existing formats. These shortcomings motivate the following suggested projects.


18.1.1 Handling Multi-byte and Varying-Width Characters

diff, diff3 and sdiff treat each line of input as a string of characters and encoding errors, where an encoding error is an input byte that is not part of any character. Single-byte and multi-byte characters are supported, along with common character encoding systems like UTF-8. The operating system’s locale specifies the character encoding, and can be specified with the LC_ALL environment variable. You can find which locales are supported on your system by running the shell command ‘locale -a’.

When counting columns for options like --expand-tabs (-t), diff consults the locale for the column width of each character, and assumes that each encoding error occupies a single column.

When ignoring case for --ignore-case (-i), diff downcases each character before comparing it, regardless of whether it is multi-byte. See Suppressing Case Differences.


18.1.2 Handling Changes to the Directory Structure

diff and patch do not handle some changes to directory structure. For example, suppose one directory tree contains a directory named ‘D’ with some subsidiary files, and another contains a file with the same name ‘D’. ‘diff -r’ does not output enough information for patch to transform the directory subtree into the file.

There should be a way to specify that a file has been removed without having to include its entire contents in the patch file. There should also be a way to tell patch that a file was renamed, even if there is no way for diff to generate such information. There should be a way to tell patch that a file’s timestamp has changed, even if its contents have not changed.

These problems can be fixed by extending the diff output format to represent changes in directory structure, and extending patch to understand these extensions.


18.1.3 Files that are Neither Directories Nor Regular Files

Some files are neither directories nor regular files: they are unusual files like symbolic links, device special files, named pipes, and sockets. Currently, diff treats symbolic links as if they were the pointed-to files, except that a recursive diff reports an error if it detects infinite loops of symbolic links (e.g., symbolic links to ..). diff treats other special files like regular files if they are specified at the top level, but simply reports their presence when comparing directories. This means that patch cannot represent changes to such files. For example, if you change which file a symbolic link points to, diff outputs the difference between the two files, instead of the change to the symbolic link.

diff should optionally report changes to special files specially, and patch should be extended to understand these extensions.


18.1.4 File Names that Contain Unusual Characters

Since diffutils-3.3, file names have been encoded to eliminate the ambiguity of unusual characters like newline and TAB. However, since any file name containing a newline may easily cause trouble, in so many contexts (not just diff and patch), a future version of diff may well reject attempts to operate on such names.


18.1.5 Outputting Diffs in Timestamp Order

Applying patch to a multiple-file diff can result in files whose timestamps are out of order. GNU patch has options to restore the timestamps of the updated files (see Updating Timestamps on Patched Files), but sometimes it is useful to generate a patch that works even if the recipient does not have GNU patch, or does not use these options. One way to do this would be to implement a diff option to output diffs in timestamp order.


18.1.6 Ignoring Certain Changes

It would be nice to have a feature for specifying two strings, one in from-file and one in to-file, which should be considered to match. Thus, if the two strings are ‘foo’ and ‘bar’, then if two lines differ only in that ‘foo’ in file 1 corresponds to ‘bar’ in file 2, the lines are treated as identical.

It is not clear how general this feature can or should be, or what syntax should be used for it.

A partial substitute is to filter one or both files before comparing, e.g.:

sed 's/foo/bar/g' file1 | diff - file2

However, this outputs the filtered text, not the original.


18.1.7 Improving Performance

When comparing two large directory structures, one of which was originally copied from the other with timestamps preserved (e.g., with ‘cp -pR’), it would greatly improve performance if an option told diff to assume that two files with the same size and timestamps have the same content. See diff Performance Tradeoffs.


18.2 Reporting Bugs

If you think you have found a bug in GNU cmp, diff, diff3, or sdiff, please report it by electronic mail to the GNU utilities bug report mailing list . Please send bug reports for GNU patch to . Send as precise a description of the problem as you can, including the output of the --version option and sample input files that produce the bug, if applicable. If you have a nontrivial fix for the bug, please send it as well. If you have a patch, please send it too. It may simplify the maintainer’s job if the patch is relative to a recent test release, which you can find in the directory ftp://alpha.gnu.org/gnu/diffutils/.