Checksum a billion files…

To gather a md5 checksum on all files in a GPFS file system, apply a policy like:

RULE EXTERNAL LIST 'A' EXEC '<SCRIPT_LOCATION_NEEDED>' ESCAPE '%'
RULE 'any name ' LIST 'A' WHERE NAME LIKE '%'

where <SCRIPT_LOCATION_NEEDED> is pointing at below hash.ksh script:

#! /usr/lpp/mmfs/bin/mmksh

outDir=$(dirname $0)
thisNode=$(hostname -s)
thisPID=$$

outDir=${outDir}/${thisNode}

suffix=$(date '+%Y-%m-%d_%H:%M:%S:%N')

if [[ -x /bin/md5sum ]]
then
  md5sum=/bin/md5sum
elif [[ -x /usr/bin/md5sum ]]
then
  md5sum=/usr/bin/md5sum
else
  echo "Unable to find md5sum executable, exiting."
  exit 1
fi

stdout=${outDir}/stdout.${thisPID}.${suffix}
stderr=${outDir}/stderr.${thisPID}.${suffix}
md5out=${outDir}/md5out.${thisPID}.${suffix}

#
# Check if the output directory exists and if not try to create it.
#
if [[ ! -d $outDir ]]
then
  # Try to create the directory where files will be placed
  mkdir $outDir
  rc=$?
  if [[ $rc != 0 ]]
  then
    echo "Failed to create output directory, ${outDir}, error ${rc}" >&2
    exit 1
  fi
fi

exec 1>> "$stdout"
exec 2>> "$stderr"

echo "$(date '+%Y-%m-%d_%H:%M:%S') Process: ${thisPID} Starting"


if [[ "$1" == "LIST" ]]
then
  #
  # The file provided by mmapplypolicy has the following format.
  #
  # InodeNumber GenNumber SnapId OptionalShowArgs -- FullPathToFile
  #
  # We are only interested in the filename (FullPathToFile) argument.  Also, we are not using the
  # SHOW directive of the policy function so the OptionalShowArgs will be empty.  So each line will
  # have the following format for this script.
  #
  # InodeNumber GenNumber SnapId -- FullPathToFile
  #

  # Everything after the 4th field is the filename.
  # Note: with RFC3986 encoding, the file system should be in 1 token.
  #
  while read -r token1 token2 token3 token4 theRest
  do
    ${md5sum} "$(/usr/lpp/mmfs/bin/mmcmi percentdecode $theRest)" >> ${md5out}
    rc=$?
    if [[ $rc != 0 ]]
    then
      echo "${md5sum} failed, error ${rc}, file $(/usr/lpp/mmfs/bin/mmcmi percentdecode $theRest)" >&2
    fi
  done < $2
fi

echo "$(date '+%Y-%m-%d_%H:%M:%S') Process: ${thisPID} Completed"

Run policy like the below, pointing at some other file system (-g) to hold temp-files during processing, and running batched of 2000 files (-B):

# mmapplypolicy /gpfs/gpfs1 -P /gpfs/janfrodefs/scale-checksum/hash.policy  -N s6k_x86_64 -g /gpfs/janfrodefs/tmp -B 2000

This will create one directory for each of the nodes running the policy,
containing one nodename/md5out* file for each batch of files. To compare
checksum from old and new file system, we can put all checksums into a database:

# cat populate-db.awk
BEGIN {
        printf "CREATE TABLE IF NOT EXISTS checksums(file text NOT NULL PRIMARY KEY, checksum text NOT NULL);\n";
        printf ".output /dev/null\n";
        printf "PRAGMA busy_timeout=20000;\n";
        printf "PRAGMA journal_mode = OFF;\n";
        printf "PRAGMA synchronous = OFF;\n";
        printf "PRAGMA locking_mode = EXCLUSIVE;\n";
        printf "PRAGMA temp_store = MEMORY;\n";
        printf ".output stdout\n";

        # Sanitize output for sqlite, just drop single quotes..
        # gsub("'", "", $2)
}
{
        # One SQL transaction per XXXX lines.
        if (NR == 1)
                printf "BEGIN IMMEDIATE TRANSACTION;\n";
        else (NR%1000000)
        {
                printf "END TRANSACTION;\n";
                printf "BEGIN IMMEDIATE TRANSACTION;\n";
        }

        printf "INSERT OR IGNORE INTO checksums(file, checksum) VALUES('%s', '%s');\n", $2, $1;

}
END {
        printf "END TRANSACTION;\n";
}

# find . -name "md5out*" -exec sed 's/  /JANFRODESEPARATOR/' '{}' + |jq  "@uri" -Rr| sed "s/'/%27/g" |awk -F 'JANFRODESEPARATOR' -f populate-db.awk |sqlite3 checksums.db

We here used “jq” to url-encode the file names, to avoid issues with special characters.

Once we have our databases generated, we can poke at it using sqlite. F.ex. find how many identical checksums we have:

sqlite> select checksum, count(*) from  checksums group by checksum having count(*) > 1 order by count(*);

Compare two different checksum databases:

sqlite> attach "sourceSystem.db" as sourcedb;
sqlite> attach "checksums.db" as targetdb;
sqlite> select a.file from sourcedb.checksums a inner join targetdb.checksums b on a.file = b.file where a.checksum <> b.checksum;

Optimizations…

If we have a node with lots of free memory, we can speed up the SQL processing quite a bit. The below “populate-db.awk” is using a temporary in-memory database before populating the final:

# cat populate-db.awk
BEGIN {
        printf "ATTACH DATABASE 'file::memory:' AS aux1;\n";
        printf "CREATE TABLE IF NOT EXISTS aux1.checksums(file text NOT NULL PRIMARY KEY, checksum text NOT NULL);\n";
        printf ".output /dev/null\n";
        printf "PRAGMA busy_timeout=20000;\n";
        printf "PRAGMA journal_mode = OFF;\n";
        printf "PRAGMA synchronous = OFF;\n";
        printf "PRAGMA locking_mode = EXCLUSIVE;\n";
        printf "PRAGMA temp_store = MEMORY;\n";
        printf "PRAGMA cache_size = -400000000;\n";
        printf ".output stdout\n";

}
{
        printf "INSERT OR IGNORE INTO aux1.checksums(file, checksum) VALUES('%s', '%s');\n", $2, $1;
}
END {
        printf "CREATE TABLE IF NOT EXISTS checksums(file text NOT NULL PRIMARY KEY, checksum text NOT NULL);\n";
        printf "INSERT INTO checksums SELECT * FROM aux1.checksums;"
}

Similar, during comparison we can configure sqlite to use larger amounts of memory to speed it up. F.ex.:

# cat count-unique-checksums.sql
PRAGMA busy_timeout=20000;
PRAGMA journal_mode = OFF;
PRAGMA synchronous = OFF;
PRAGMA locking_mode = EXCLUSIVE;
PRAGMA temp_store = MEMORY;
PRAGMA cache_size = -200000000;

select checksum, count(*) from  checksums group by checksum having count(*) > 1 order by count(*);