2021-03-27 14:17:16 (edited by Zarvox 2021-03-27 14:19:50)

Below is the function I use to collect a list of files from a starting directory. So far I have only tested this function in a discord bot application. There is both a regular and an asynchronous version, and they both have the same result. The bot times out and stops working, and even if it didn't time out, it is taking over a minute to process.

from pathlib import Path

async def search_system(starting_path, include_root=True, formats= ["wav", "flac", "mp3", "ogg", "m4a", "wma"]):
 supported_files=[]
 for root, dirs, files in os.walk(Path(starting_path)):
  for file in files:
   #checks the file's extension
   if len(formats)>0:
    ext= stringutil.trim(f"{file}", ".", "after", "last")
   if len(formats)==0 or ext in formats:
    if include_root:
     file= os.path.join(root, file)
    supported_files.append(file)
 return supported_files

If the trim function could be the reason it is slow, I can provide that too. However, I don't think that is the case.
For instance, it will work on a folder that has about 15 sub folders, and an average depth of 3 sublayers, with about 15 or 30 seconds of time, but I know it is definitely under a minute. But if I move that directory down one more layer, that's when it starts having problems. I remember testing this function before putting it in the discord bot, and that folder was actually processed, with a result of about 19000 files.
Does anyone have any tips for making this faster? I'm pretty sure the bot crashes because of a discord time out. But as for speed, what can I do? I know python is a slow language, but mine will take over a minute even without being asynchronous, while my friend's js bot can compile that folder plus more, in 15 seconds lol.

2021-03-27 17:31:01

Python can handle this case and much more fine.  This is another place where you shouldn't blame Python.  A few thousand files should be iterating in under a second and the depth of the directory tree shouldn't matter.  At least unless it's not an SSD at any rate.

First, the solution to all your woes is the glob module.  You can grab everything with it, then iterate over the list.  You might have to use os.path to process them,assuming you ever actually don't want the root, though, but that's not a big deal.

Second, if you're not going to use the glob module you should probably match using a pre-compiled regex.  I'm not sure what your trim function is doing but I'm pretty sure that's where the problem is going to be.  I'm guessing that you either do a whole lot of string splitting by hand, or you make a regex every time through it.  Either is pretty bad.  Just looking at the arguments tells me that its either a very weird function or a very complicated function that does more than it appears that it should.

As I recall the regex you need is:

\.(ext1|ext2|ext3|ext4)$

Which you want to use re.compile on up front.  That might be slightly wrong.  To build it:

ext_joined = "|".join(formats)
pattern = "\.({})$".format(ext_joined)
regex = re.compile(pattern)

Then you use the regex object for all your matching.  Be careful to guard against the case of no formats; that will produce an invalid pattern (you must avoid compiling it at all; I'm reasonably sure the result is syntactically invalid regex).

This:

if len(formats)==0 or ext in formats:

Can be:

if ext in formats:

Since nothing is ever in the empty list.  That won't shave off a lot of time, but it's cleaner code.  Your next response is going to be "but ext isn't defined": that's fine, just do ext = "" before the optional trim call and it'll always be a value.

This shouldn't be an async function because everything in the function is synchronous.  Putting the word async before a function doesn't make it work on the main thread of the event loop unless you're also doing await inside the function.  In general no file operations can be made async because of OS limitations, so you either have to eat the cost of being synchronous for small amounts of time on the main event loop and do periodic async sleeps, or move it to a thread pool.  I haven't done enough with Python's async to know exactly what to point you at, but there are "please run this in a thread pool and give me a future" functions in the standard library that will do what you want and this particular case is literally what they were invented for.

I would be curious as to your timing methodology.  There is a module called timeit.  You do:

import timeit

def test():
    # some code you want to time

count = 100
res = timeit.timeit(test, number=count)
print(res / count, "per second")

Number defaults to 1000000 so you should always pass it.  This tells you how much time a function takes per second.  For really fast functions you want higher numbers; for long functions like directory iterating, really low ones are fine.  If your function isn't async (which as I said it shouldn't be) this can be used to time it, but it can also be used to time your trim function alone which may be very informative here.  In general "I feel like it takes some time on some directories" isn't helpful: "it takes 13 seconds on 500 files, but 100 seconds on 1000 files" is very useful by contrast (it tells you that doubling the number of files more than doubles the time, which wouldn't make sense for this task).

My Blog
Twitter: @ajhicks1992

2021-03-27 20:10:01

BTW, if you feel like something is slowing down your code, don't hesitate using the profile / cProfile module of Python to track down how much time was taken by which function/method, tracking down the time issues to the core.

2021-03-27 20:44:34

Here is the code for my string trim function

def trim(the_string, the_text, keep_direction, occurrence, ignore_adjacents=False):
 try:
  the_string=str(the_string.strip())
 except AttributeError:
  pass
 the_text=str(the_text)
 keep_direction=str(keep_direction)
 before_string=the_string
 after_string=the_string
 #isolating was a beta that I didn't put work into
 isolated_string=the_string
 matches=[]
 index=0
 while index<len(the_string):
  time.sleep(0.005)
  #we check for any matches, with the starting value increasing as we search
  index=the_string.find(the_text,index)
  if index==-1:
   break
  else:
   matches.append(index)
  """if adjacents is set to true, we ignore the number of characters in the search string to prevent overlaps
  You will usually want to set this to true if your string is more than 1 character"""
  if not ignore_adjacents:
   index+=1
  else:
   index+=len(the_text)
 if len(matches)>0:
  if occurrence=="first":
   index=matches[0]
  elif occurrence=="last":
   index=matches[len(matches)-1]
  else:
   index=matches[occurrence-1]
  if keep_direction=="before" or keep_direction=="split":
   before_string= the_string[:index]
  if keep_direction=="after" or keep_direction=="split":
   after_string= the_string[index+len(the_text):]
  #this was suppose to keep a string between 2 occurrances, but not sure if it works lol
  if keep_direction=="isolate":
   isolated_string= the_string[matches[occurrence[0]]:-matches[occurrence[1]]]
   return isolated_string
 if keep_direction=="before":
  return before_string
 elif keep_direction=="after":
  return after_string
 elif keep_direction=="split":
  return before_string, after_string

2021-03-27 20:44:40 (edited by pauliyobo 2021-03-27 20:48:01)

I think it's also worth mentioning that you can utilize the glob method if you are working with Path objects.
Edit:
@4 while this may most likely be preference, you can easily retrieve the last element in a list by using -1 as index. This logic applies with the indexes you want to retrieve in reverse. So you'd use -2 if you wanted to get the second to last element

Paul

2021-03-27 21:23:06

If your just trimming a string and you don't need any other functionality, Python provides this in the str class: lstrip, rstrip, and strip.

"On two occasions I have been asked [by members of Parliament!]: 'Pray, Mr. Babbage, if you put into the machine wrong figures, will the right answers come out ?' I am not able rightly to apprehend the kind of confusion of ideas that could provoke such a question."    — Charles Babbage.
My Github

2021-03-27 21:34:17

The bug is that there's a time.sleep in the middle of that string trimming function for 50 milliseconds that executes for every character in the string.  I don't know what you think it's doing, but it's not doing anything but making the function slow (presumably you did it because you think it helps the event loop?).  The directory depth matters because when you use the directory there's more characters per string, which count for 50ms sleeps each.

But even so, you should either split that function into multiple functions or get rid of it entirely.  That's going to be slow-ish at best, no matter the language, and is terrible code--you've got doc comments in the middle, "I didn't implement this notes", and there's no way it's bug-free.  You can do most of what it does either with built-in string methods or regexes which will be faster, but also have the advantage of definitely working.  For example rfind can find from the end of the string instead of the start.  In general if you find yourself writing string algorithms you will then find out Python has a built-in function somewhere that can do it for you.

My Blog
Twitter: @ajhicks1992

2021-03-27 22:10:02 (edited by Zarvox 2021-03-27 22:13:19)

50 ms? Holy crap I thought that was 5. And I knew about the lstrip and rstrip, but didn't think that would work, I was very new to python when I created this and I kind of ran with it ever since. I tried looking for some answers on duck duck go, but couldn't find any.
I also don't have any idea what these modules like the glob module and regexes. I am a complete ignorant idiot with built in methods and all of the modules. I never claimed to be good at coding or have good practices. But I'm starting to hit lots of walls now. How do you keep up with what modules are in python and what each one is for and what each one can do? Oh, and how in the world do you understand the python documentation? That thing is worded in vocabulary I don't know and seems to ramble on about things only semi related.
@5 I forgot about using the minus sign to retrieve elements in a list from reverse order, thanks for that.

2021-03-27 22:17:36

You shouldn't sleep at all in any way ever unless you want the whole program to stop.  That sleep is doing nothing at all beneficial in any way.  I'm not sure if you got that, being that your reaction seems to be "yeah it's supposed to be there".  if you're trying to yield to the event loop you need to await asyncio.sleep(seconds), but in this case even iterating 20000 files is going to be so fast that the overhead of sleeping in that way isn't worth it and you'll not block the loop for long enough to matter (but again: asyncio has tools to call it in a thread, if it matters use those).

In your case if you don't want to use re, you can do ext=name[-4:] and add a "." to all the exts and then use that instead.  Python slice syntax doesn't care about if you're past the end and will just stop slicing, so "abc"[-4:] is "abc".

My Blog
Twitter: @ajhicks1992

2021-03-27 22:25:41

You don't need time.sleep in fast loops like that. The main loop of your game that runs for a long time yeah because otherwise you're not giving your CPU any time to breathe, but in python the magic number is 0.001. Camlorn is wrong though, afaik that is 5 MS.

2021-03-27 22:50:22

Yeah, that's 5ms, not 50. Either way the time.sleep is completely unnecessary and adds an extra 5ms to your loop.

"On two occasions I have been asked [by members of Parliament!]: 'Pray, Mr. Babbage, if you put into the machine wrong figures, will the right answers come out ?' I am not able rightly to apprehend the kind of confusion of ideas that could provoke such a question."    — Charles Babbage.
My Github

2021-03-27 22:59:49

Yeah, you're right.  Got off by a zero somewhere along the way.  But that's still the problem either way.

My Blog
Twitter: @ajhicks1992

2021-03-28 02:37:55

Another suggestion besides the glob module is using os.path.splitext, it will return a list with the file name as first item and extension as second item.
However, there might be more functions in the os.path module worth taking a look, for example os.path.basename, that returns only the file name with its path removed.
Hope it helps.

2021-03-28 04:21:43

@8 I opted to use the trim function rather than using that string method as some extensions are longer than 4 characters. For example aiffc.
@13 oh that is so cool; exactly what I am looking for. Why I never thought to look at the os module in depth, idk. I'm a dumbass.

2021-03-28 09:01:43

The problem with the glob module is for every extension, it has to recompile the list and extend it. I don't like that because it wastes a lot of unnecessary time by rechecking all of the directories. Would using os.walk and os.path.splitext be slower than using glob multiple times?

2021-03-28 09:22:57 (edited by pauliyobo 2021-03-28 09:23:23)

@15. Sorry, I don't understand why would you want to use glob multiple  times.
Glob will also match files recursively, if that is the reason, so you don't need to run it more than once if that's your objective.

Paul

2021-03-28 12:33:14

for multiple extensions. There doesn't seem to be a way to search for more than 1 pattern at once.

2021-03-28 17:35:52

You run glob once, which gets you the entire directory tree, then you process the list to get rid of any of the ones you don't want after the fact.  This is likely faster than os.walk (though you should benchmark) because it will delegate to C  while it builds the list rather than literally grabbing every file one at a time. 

For a non-fixed-length extension:

split = filename.rsplit(".", 1)
ext = ""
if len(split) == 2:
    ext = split[1]

And to get just the filename without the directory:

directory, filename = os.path.split(path)

Note that os.path is a module that you have to import as os.path, and if there's no directory directory gets set to the empty string.

You may also want os.path.abspath which converts relative paths to absolute paths; this can add reliability in some edge cases, so I generally always run filenames through it.

Now I'm going to extrapolate one further.  I assume that you're trying to pass these out to something that wants the filename and the path.  So you can:

from collections import namedtuple

FileIterationResult = namedtuple("FileIterationResult", ["filename", "path"])

def iterate_files():
    for f in # get the files here:
        results.push(FileIterationResult(filenamename=the_name, path=the_path))

And then everything outside of it can do file.filename, file.path, and so on.

My Blog
Twitter: @ajhicks1992

2021-03-29 11:04:34 (edited by Zarvox 2021-03-29 11:12:06)

Ah I see. I need to review the str methods because I didn't know rsplit could do that. That's almost exactly what my trim function does. How come when I look up these questions online, no one gives answers like this? They always try to use a custom way of doing it, as if rsplit doesn't exist. Because here is the thing. I don't make custom functions because I think I know coding better than the devs of the language, lol that's ridiculous. I build custom functioons, because I am unaware of built in functions that can already do this. And any time I look something up online, they always want to use custom stuff. And any time I wanted to put in a search term that had to do with directories, it was always the os module that was being shown. Glob never showed up in results. I mean, os can do more related to directories, but not even a reference to glob has been found in anything I read. I have to specifically put glob module in my search.
I suppose what I should do is look up a very simplified list of all the modules in python and what they are for. Then expand my knowledge from there. I realize now that I know almost nothing about this language and what it can do. Just because I use it doesn't mean I understand it, and if I want to get things done correctly, I need to learn what all this language is capable of, or at least, what modules I can use in the project I am working on.

2021-03-29 12:20:19

You could also just browse the module index of the standard library and pick what picks your interest. Each module has usually a little description of what it does.

Paul

2021-03-29 14:13:32

Yeah I should look on python.org for documentation about the modules.

2021-03-29 15:44:06

read all of this page.  All of it:  https://docs.python.org/3/library/stdtypes.html

If you aren't searching with Google you should search with Google, for what that's worth: you won't beat them for finding programming answers.  But we don't learn this stuff through search.  We learn this stuff through reading the manual, as I said in the other thread.  You can't learn this stuff by always trying to answer the specific thing in front of you.  You have to learn before you try to solve problems, not after.  I mean sure sometimes you just solve the specific problem with Google or whatever for lack of time or whatever, but at your level of learning if you can think "this sounds like a problem lots of apps have to solve" then either it's built into the language or it's on Pypi.

My Blog
Twitter: @ajhicks1992

2021-03-29 15:48:45 (edited by zakc93 2021-03-29 15:51:19)

You can use a list comprehension to get a list of files with the extensions you want. Example:
from pathlib import Path
supported_files = [x for x in Path(starting_path).iterdir() if x.suffix in formats]
A path object has a method .iterdir() which returns all the files and directories in that directory, although it doesn't go through the child directories like os.walk does. There is also a .glob method to return files/directories matching a pattern, such as mp3_files=starting_path.glob("*.mp3")
Path objects already have built in properties like .name for the filename, .stem for the name without the extension, .suffix for the extension, .parent for the path without the last item, etc. See https://docs.python.org/3/library/pathlib.html
Lets say you loop through your list, some useful things you can get. Lets say your first item is a file called some.thing in D:\documents\some.thing
for i in supported_files:
    file_name=i.name #some.thing
    file_extension=i.suffix #.thing
    file_name_without_extension=i.stem #some
    directory_where_file_is_located=i.parent #path object of D:\documents
    list_of_path_components=i.parts #[D:, documents, some.thing]
    i.is_file() #checks if i is a file
    i.is_dir() #checks if i is a directory
    i.exists() #checks if i exists, which would obviously be true in this case
    str(i) #can be used on any path object to give a string representation of the path, "D:\documents\some.thing" in this case
Hope that helps. I'm not sure if os.path has any advantages that I'm not aware of, but I just prefer doing things with pathlib.

2021-03-29 15:59:16

Oh forgot to mention, you can use .rglob to find matches to your pattern in all subdirectories of your path as well. So if you want to get everything in all subdirectories, instead of .iterdir() you could do .rglob('*')