macOS python optimization

samwich · Nov 4, 2013

Hey there, looking for some ideas to optimize this script. It recursively searches in the specified directory and prints out the most recent accessed files and most recent modified files. Please _don't_ tell me to write it in C or use pypy or something, I'd like to try to keep this in plain python.

Code:

import os
import stat
import datetime as dt
import argparse
from pprint import pprint
 
 
def print_files(num_files, directory):
    """
    gets a list of files sorted by modified time
 
    keyword args:
    num_files -- the n number of files you want to print
    directory -- the starting root directory of the search
 
    """
    modified = []
    accessed = []
    rootdir = os.path.join(os.getcwd(), directory)
 
    for root, sub_folders, files in os.walk(rootdir):
        for file in files:
            try:
                unix_modified_time = os.stat(os.path.join(root, file))[stat.ST_MTIME]
                unix_accessed_time = os.stat(os.path.join(root, file))[stat.ST_ATIME]
                human_modified_time = dt.datetime.fromtimestamp(unix_modified_time).strftime('%Y-%m-%d %H:%M:%S')
                human_accessed_time = dt.datetime.fromtimestamp(unix_accessed_time).strftime('%Y-%m-%d %H:%M:%S')
                filename = os.path.join(root, file)
                modified.append((human_modified_time, filename))
                accessed.append((human_accessed_time, filename))
            except:
                pass
 
    modified.sort(key=lambda a: a[0], reverse=True)
    accessed.sort(key=lambda a: a[0], reverse=True)
    print('Modified')
    pprint(modified[:num_files])
    print('Accessed')
    pprint(accessed[:num_files])
 
 
def main():
    parser = argparse.ArgumentParser()
    parser.add_argument('-n',
                        '--number',
                        help='number of items to return',
                        type=int,
                        default=1)
    parser.add_argument('-d',
                        '--directory',
                        help='specify a directory to search in',
                        default='./')
 
    args = parser.parse_args()
 
    print_files(args.number, args.directory)
 
if __name__ == '__main__':
    main()

mrichmon · Nov 5, 2013

You don't state what you are looking to optimize for. I assume you mean execution time, but many times people need to optimize for things other than execution time. (eg, memory usage)

In general for time-optimization, you should consider profiling your code to determine where the bulk of the time is spent. You can easily profile by introducing time calculations and then gradually narrowing down to the main time costs.

Based on my experience, your primary time costs are likely due to:

sorting your lists
the memory allocation while building the lists

But since you are only interested in the most recent time, why are you building the list and sorting them at all? How about an approach that just uses two variables:

most recent modification time found so far (most_recent_time)
modification time of current file (current_time)

If the most_recent_time is None, then most_recent_time = current time.
Else if current_time > most_recent_time, then current_time = most_recent_time and most_recent_file = current_filename.

Also, there is no reason to convert a time to human-readable format until you've found the time that you want to report. Just by moving the time conversion out of the loop and eliminating the conversion until immediately before the print statement will significantly reduce the number of computations required.

chown33 · Nov 5, 2013

mrichmon said:
If the most_recent_time is None, then most_recent_time = current time.
Else if current_time > most_recent_time, then current_time = most_recent_time and most_recent_file = current_filename.

A common optimization is possible here. Instead of a special "None" initial value, initialize most_recent_time with an "impossible" value that compares less than any possible value that will be encountered during the scan. For Unix-times, a common starting value is 0, which represents the Unix epoch time of midnight 01 Jan 1970 GMT.

The result is that a separate test for "None" is no longer necessary, so the conditional in the loop simplifies to:

Code:

If current_time > most_recent_time, then current_time = most_recent_time and most_recent_file = current_filename.

I don't recall the name of this technique, but it's kind of the opposite of a sentinel value:
http://en.wikipedia.org/wiki/Sentinel_value

mrichmon · Nov 5, 2013

chown33 said:
A common optimization is possible here. Instead of a special "None" initial value, initialize most_recent_time with an "impossible" value that compares less than any possible value that will be encountered during the scan. For Unix-times, a common starting value is 0, which represents the Unix epoch time of midnight 01 Jan 1970 GMT.[/url]

Fair enough. Since python ints are unsigned it can be easily initialized to an invalid time value of "-1". For the same effect.

In my text I was aiming for clarity since the original poster has implemented code where a whole lot of unnecessary work is performed repeatedly in the loop. But I am certainly making many assumptions here.

Also, the original code is printing the n most recently modified files. So a single variable approach will not solve the stated problem. But, an extension to the single variable approach that only maintains a list of at most n timestamps is possible at the cost of a significant increase in the amount of code needed.

subsonix · Nov 5, 2013

mrichmon said:
Just by moving the time conversion out of the loop and eliminating the conversion until immediately before the print statement will significantly reduce the number of computations required.

Why? It seems to be the same amount of computations, you have just postponed them to a later time.

mrichmon · Nov 5, 2013

subsonix said:
Why? It seems to be the same amount of computations, you have just postponed them to a later time.

Multiple reasons:

The code as posted builds up a list of times and then outputs only a portion of that list in the case where the number of files in the folders is larger than num_files. So, for all the files that are not output to the user a relatively expensive conversion from an integer to a DateTime object is performed but does not contribute to the program output.
Comparison of complex data types such as the DateTime objects being compared in the sort operations is generally more expensive than comparison of integers. One of the reasons why Unix epoc times are recorded as integers is to lower the cost of time comparisons.

By postponing the operations you can reduce the overall number that are necessary. For example, assume 1000 files in the subfolders where n is 10. In the original posted code there are 2000 calls to instantiate a DateTime object. If the lists are built using the integer time value and the DateTime objects are instantiated after the sort operations during the output processing, then only the 20 output time values need to be instantiated as DateTime objects. That's a reduction of 1980 object instantiation operations in this example.

subsonix · Nov 5, 2013

mrichmon said:
Multiple reasons:

The code as posted builds up a list of times and then outputs only a portion of that list in the case where the number of files in the folders is larger than num_files. So, for all the files that are not output to the user a relatively expensive conversion from an integer to a DateTime object is performed but does not contribute to the program output.

Right, I missed that num_files was an argument to the print function.

mrichmon said:
Comparison of complex data types such as the DateTime objects being compared in the sort operations is generally more expensive than comparison of integers. One of the reasons why Unix epoc times are recorded as integers is to lower the cost of time comparisons.

Yes but that is not a direct function of moving the conversions out of the loop but storing the dates as unix time stamps, which I agree with.

samwich · Nov 6, 2013

Thanks! Yes, I was wanting to optimize for execution time. Good catch - I will definitely keep instantiate the datetime objects outside the loop.

I'll also look into maintaing a compact array instead of one large array

Search

Search

macOS python optimization

samwich

macrumors regular

mrichmon

macrumors 6502a

chown33

Moderator

mrichmon

macrumors 6502a

subsonix

macrumors 68040

mrichmon

macrumors 6502a

subsonix

macrumors 68040

samwich

macrumors regular

Our Staff