When you recursively scan a massive directory tree, would you use Sys.readdir
or Unix.readdir
? My inclination is that Sys.readdir
feels more convenient to use, and thus the lower-level Unix.readdir
would have the performance edge. Is it significant enough to bother with?
Quickly coding up the two different options for comparison. Here’s the Unix.readdir
version, running Unix.opendir
then recursively calling Unix.readdir
until the End_of_file
exception is raised.
let rec traverse_directory_unix path x =
let stats = Unix.lstat path in
match stats.st_kind with
| Unix.S_REG -> x + 1
| S_LNK | S_CHR | S_BLK | S_FIFO | S_SOCK -> x
| S_DIR ->
try
let dir_handle = Unix.opendir path in
let rec read_entries acc =
try
match Unix.readdir dir_handle with
| "." | ".." -> read_entries acc
| entry ->
let full_path = Filename.concat path entry in
read_entries (traverse_directory_unix full_path acc)
with End_of_file ->
Unix.closedir dir_handle;
acc
in
read_entries x
with _ -> x
The Sys.readdir
version nicely gives us an array so we can idiomatically use Array.fold_left
.
let traverse_directory_sys source =
let rec process_directory s current_source =
let entries = Sys.readdir current_source in
Array.fold_left
(fun acc entry ->
let source = Filename.concat current_source entry in
try
let stat = Unix.lstat source in
match stat.st_kind with
| Unix.S_REG -> acc + 1
| Unix.S_DIR -> process_directory acc source
| S_LNK | S_CHR | S_BLK | S_FIFO | S_SOCK -> acc
with Unix.Unix_error _ -> acc)
s entries
in
process_directory 0 source
The file system may have a big impact, so I tested NTFS, ReFS, and ext4, running each a couple of times to ensure the cache was primed.
Sys.readdir
was quicker in my test cases up to 500,000 files. Reaching 750,000 files, Unix.readdir
edged ahead. I was surprised by the outcome and wondered whether it was my code rather than the module I used.
Pushing for the result I expected/wanted, I rewrote the function so it more closely mirrors the Sys.readdir
version.
let traverse_directory_unix_2 path =
let rec process_directory s path =
try
let dir_handle = Unix.opendir path in
let rec read_entries acc =
try
let entry = Unix.readdir dir_handle in
match entry with
| "." | ".." -> read_entries acc
| entry ->
let full_path = Filename.concat path entry in
let stats = Unix.lstat full_path in
match stats.st_kind with
| Unix.S_REG -> read_entries (acc + 1)
| S_LNK | S_CHR | S_BLK | S_FIFO | S_SOCK -> read_entries acc
| S_DIR -> read_entries (process_directory acc full_path)
with End_of_file ->
Unix.closedir dir_handle;
acc
in
read_entries s
with _ -> s
in
process_directory 0 path
This version is indeed faster than Sys.readdir
in all cases. However, at 750,000 files the speed up was < 0.5%.