Become a MacRumors Supporter for $50/year with no ads, ability to filter front page stories, and private forums.

Caesar_091

macrumors 6502
Original poster
Jan 18, 2005
289
12
Italy
Hi all,

I'm not a CLI guru and I have to run the same terminal command on several files. Each command takes up to 5-6h to complete on the Mac Pro in signature and I can run up to ten simultaneous tasks (I already tested it and with more than 10 tasks at the same time the system starts to suffer for CPU and RAM usage).
The problem is that I have to run the same command on hundreds of couple of files :D

The command, in short, is something like this:
Code:
command -output /path/to/output/dir/fileA -input fileA1 -input fileA2 -option1 -option2 -option3
Since I have to repeat it for hundreds of couples of files (fileB1 and fileB2, fileC1 and fileC2 .... fileXXX1 and file XXX2). Is a bash script the easiest way to run a definite number of simultaneous task? Should I look at Apple Script instead? I imagine that standardizing the location of source files and file names is a good start... but then how should I move on?

Any help (even a good link to basic bash scripting) is much appreciated o_O
 
Please describe how you plan to provide each of the following:
/path/to/output/dir/fileA
fileA1
fileA2​

Those items are variables. That is, they vary or change for each invocation or use of the command. Taken together, the variables make up a set of values that are used to perform exactly one task. A different set of variables would constitute a different task.

I assume none of the other parameters, such as -option1, etc. are variables. If they are, then you'll have to provide them with each set of variables used for a single task.

The variables might be provided in an input file you've edited. They might be produced by having a script that scans a folder and extracts names that match a certain pattern, such as "find all the files that start with 'A' and have '.xyzzy' as the suffix. You need to tell us exactly how the variables will be provided.


It appears you want to have around 10 concurrent tasks running at once. This kind of concurrency is probably simplest to do in bash, where ending a command-line with & tells the shell to run the command in the background (i.e. without waiting for it to finish).

If it's possible, I recommend segmenting your overall input into 10 distinct series of variable-sets, then starting 10 concurrent bash scripts, one servicing each independent series. Each series of variable-sets is independent of all the others, and there's no contention for any common resources. This independence will make it easier to test and refine things, because you can start with 1 series, then expand to 2, etc. simply by splitting the 1 big series into as many independent sub-series as you want. Otherwise, if you have 10 concurrent tasks all contending for access to a single shared list, you have to deal with the contention and coordination, so no set is done twice, and none is skipped. Independent will be simpler than dependent (the dependency is the shared provider of a variable-set for the next task).

If you expect all these tasks to be done again many times in the future, then it would probably be worthwhile to make a shared component that dispenses work and avoids contention issues. This is often called a "dispatcher" or "producer" object, but it can be known by other terms. If that's something you expect to do, it can be done later, after you've got the basic parts working properly.
 
First of all thanks for thaking the time to answer that exhaustively ;)

Please describe how you plan to provide each of the following:
/path/to/output/dir/fileA
fileA1
fileA2​

Those items are variables. That is, they vary or change for each invocation or use of the command. Taken together, the variables make up a set of values that are used to perform exactly one task. A different set of variables would constitute a different task.

You're right: those are the only variables.

My idea is to put each single couple of files in the same folder or using a script to scan for those files inside a bigger folder with subfolders. The output folder name should be created starting from the first part of the filename. I'll run a task for fileA1 and fileA2 with the target output to folder /../fileA, than one task for fileB1 and fileB2 using the output folder /../fileB, etc etc.

I assume none of the other parameters, such as -option1, etc. are variables. If they are, then you'll have to provide them with each set of variables used for a single task.

You're assuming right: those command line options are all the same for each single task.

The variables might be provided in an input file you've edited. They might be produced by having a script that scans a folder and extracts names that match a certain pattern, such as "find all the files that start with 'A' and have '.xyzzy' as the suffix. You need to tell us exactly how the variables will be provided.

This looks pretty much what I was looking for. The files have all the same name format:
filename1CODE1CODE2
filename2CODE1CODE2
filename3CODE1CODE2
..
..
filnenameXCODE1CODE2

It appears you want to have around 10 concurrent tasks running at once. This kind of concurrency is probably simplest to do in bash, where ending a command-line with & tells the shell to run the command in the background (i.e. without waiting for it to finish).

Ok.

If it's possible, I recommend segmenting your overall input into 10 distinct series of variable-sets, then starting 10 concurrent bash scripts, one servicing each independent series. Each series of variable-sets is independent of all the others, and there's no contention for any common resources. This independence will make it easier to test and refine things, because you can start with 1 series, then expand to 2, etc. simply by splitting the 1 big series into as many independent sub-series as you want. Otherwise, if you have 10 concurrent tasks all contending for access to a single shared list, you have to deal with the contention and coordination, so no set is done twice, and none is skipped. Independent will be simpler than dependent (the dependency is the shared provider of a variable-set for the next task).

If you expect all these tasks to be done again many times in the future, then it would probably be worthwhile to make a shared component that dispenses work and avoids contention issues. This is often called a "dispatcher" or "producer" object, but it can be known by other terms. If that's something you expect to do, it can be done later, after you've got the basic parts working properly.

Things start to get harder for me to understand but I'll (try to) figure out. Probably I'll need to ask some help to some local geek / CLI guru :)
 
My idea is to put each single couple of files in the same folder or using a script to scan for those files inside a bigger folder with subfolders. The output folder name should be created starting from the first part of the filename. I'll run a task for fileA1 and fileA2 with the target output to folder /../fileA, than one task for fileB1 and fileB2 using the output folder /../fileB, etc etc.

1. Running a script to produce the input variables in the proper pairing is probably more difficult than reading names from a plain text file. The reason is simple: the first thing you have to do is produce usable input for the work task.

Without well-formed input data, the work task can't be developed, tested, or made to work. But before anyone can even start developing the work task's script or scripts, they first have to develop the script that produces the well-formed input data.

In short, you have a prerequisite script to develop first.


2. One of the Rules of Thumb in programming is, "Be specific". This also applies when asking questions about programming. So far, you've been vague about important aspects of the problem you're trying to solve.

A script to produce properly paired filenames as input variables will need to be specific about how it matches filenames to the pattern for an input filename. That is, some files found during the scan will NOT be acceptable inputs, and the script must ignore them. But the only way anyone can write that script is to know what pattern an acceptable filename matches. These are too vague:
fileA1 and fileA2 with the target output to folder /../fileA
fileB1 and fileB2 using the output folder /../fileB,
filename1CODE1CODE2
filename2CODE1CODE2
filename3CODE1CODE2​

In the first two cases, it looks like the pattern to match is the exact word "file" followed by an uppercase letter, then a single digit. If that isn't the pattern, then you need to be specific.

The "output folder" isn't a valid pathname. The initial "/" means the root of the filesystem (i.e. the root folder of the startup drive). The ".." refers to the "parent folder", which is meaningless for the root of the filesystem. "fileB" appears to be the concatenation of the base word "file" (presumed to be invariant for all pairs), and the upper-case letter (invariant for a pair, but varies for different pairs).

It's possible you meant to write "../fileB", which would be the parent folder of (presumably) the folder where "fileB1" and "fileB2" reside. That's a guess on my part, because the description is too vague, and you didn't give any specific examples with complete pathnames.

Or maybe you didn't mean for the "/../" to indicate an actual parent folder at all, and instead you meant it as an ellipsis or placeholder string. I can't tell.

In the latter 3 cases, I can't tell what "CODE1CODE2" represent. Does each "CODEn" represent a single Unicode character? Is "CODE" a literal string "CODE" followed by a single digit? Is "CODE1" a multi-character substring, and "CODE2" a different substring? Are "CODE1" and "CODE2" multi-digit numbers?

Without more specific details about exactly what kind of name matching patterns are involved here, I don't see how this could be programmed. A program is a description of specific step-by-step actions to take that accomplish a goal. Being specific is essential.


3. The command that traverses directory hierarchies, matches name patterns, and performs actions is named 'find'. It has its own syntax for specifying what to do, distinct from shell syntax.

I suggest that you look at the man page for the 'find' command, and try some things yourself. If you run into a problem, or have a specific question to ask, feel free to ask it. For example, if you tried a particular 'find' command-line and got unexpected results, then ask. If you do, please follow these Rules of Thumb:
1. Be specific.
2. Post your code. Show the complete exact cmd-line you used; copy and paste it. Details matter.
3. Describe what you expected to happen.
4. Describe what actually happened, and show any actual output.​


Things start to get harder for me to understand but I'll (try to) figure out. Probably I'll need to ask some help to some local geek / CLI guru

If you have someone to interact with locally, that's probably going to work out better in the long run. Be prepared to provide details to that person.
 
Register on MacRumors! This sidebar will go away, and you'll see fewer ads.