Become a MacRumors Supporter for $50/year with no ads, ability to filter front page stories, and private forums.

Cromulent

macrumors 604
Original poster
Oct 2, 2006
6,816
1,101
The Land of Hope and Glory
I've attached an example data file that the program needs to read (it is actually a comma separated file but had to change the extension). The thing is, it needs to read data like this from multiple streams and process it at speeds up to one piece per second. It will be working discreetly to calculate things such as the moving average and must also take into account things such as network latency and inability to access data without hanging the computer.

I'm really not looking for programming help as such as I can handle that part I'm more asking for help with how best to approach the problem. What do I need to look out for? What would the best approach be? Does anyone have any experience with reading in large data sets and processing them? Especially when the data is being streamed to the computer from the internet in real time.

I would imagine it would be handy to write a helper application that just receives the data from the net and stores it into a file so that the main program can process it at it's leisure and does not need to worry about the network side of things. I guess it also helps with security as you are sure the data you are reading is in the correct format.

All this will be done in C, although I'm starting to think Python maybe a better alternative as writing the file handling part is unnecessarily complex in C.
 

Attachments

  • data.txt
    9.8 KB · Views: 162

toddburch

macrumors 6502a
Dec 4, 2006
748
0
Katy, Texas
Archives of stock data probably won't really be coming in that THAT heavy, so I would most likely use a regular expression for data validation. Ruby, Python, Perl, C - all would be fine for this.

One approach I take for this type of thing is to use a scripting language first - for proof of concept and quick implementation. Then, if the performance sucks, go to a faster language.

Todd
 

Cromulent

macrumors 604
Original poster
Oct 2, 2006
6,816
1,101
The Land of Hope and Glory
Archives of stock data probably won't really be coming in that THAT heavy, so I would most likely use a regular expression for data validation. Ruby, Python, Perl, C - all would be fine for this.

One approach I take for this type of thing is to use a scripting language first - for proof of concept and quick implementation. Then, if the performance sucks, go to a faster language.

Todd

Depends. Ultimately it won't be archives but streaming and you can get data as quick as once a second.

Good plan with the scripting language though, that should knock quite a bit of time off initial development.
 

lee1210

macrumors 68040
Jan 10, 2005
3,182
3
Dallas, TX
once per second there could be 1 record or once per second there could be 10MB on the line? This makes a big difference in how you poll, if you need a dedicated I/O thread, etc.

-Lee
 

Cromulent

macrumors 604
Original poster
Oct 2, 2006
6,816
1,101
The Land of Hope and Glory
once per second there could be 1 record or once per second there could be 10MB on the line? This makes a big difference in how you poll, if you need a dedicated I/O thread, etc.

-Lee

Once per second for one line of data which is probably going to be 1/2KB, if that. The problem comes when you need to do the mathematical calculations on the data and have less than one second to download the data, append it to the end of the file and perform the calculations.

Basically I need it to run as fast as possible as a delay is not going to be very good for the calculations. I'm guessing I'm going to have to have an I/O thread and at least one data processing thread (probably one per algorithm).

All the data is will be the current stock price and the time. But there are at least 3 different things I need to calculate, one of which requires backtesting a certain number of previous data in order to make sure it is correct.
 

lee1210

macrumors 68040
Jan 10, 2005
3,182
3
Dallas, TX
Does the result of the calculation for the current line of input have to be complete before you read the next? Does it need to be complete before you start the calculation on the next line of input?

Depending on the requirements it might be easiest to have a separate thread (or even process) grabbing data from the line and putting it in a queue (could be a hand-crafted queue, a database table, etc.). Your processing thread(s)/process(es) would need to be able to grab the "next" record if the order of the calculations are important, or any available item if not.

-Lee
 

Cromulent

macrumors 604
Original poster
Oct 2, 2006
6,816
1,101
The Land of Hope and Glory
Does the result of the calculation for the current line of input have to be complete before you read the next? Does it need to be complete before you start the calculation on the next line of input?

Yep. It is a real time application that will eventually be used to spot simple trends in the numbers as they are received. Therefore it is imperative that all calculations are complete before the next set of data is received.

The point is that eventually it won't be limited to one stream and the data can be received at different intervals for each stream (one could be once a second, one could be once every 15 mins, one could be once a day).
 

Cromulent

macrumors 604
Original poster
Oct 2, 2006
6,816
1,101
The Land of Hope and Glory
Okay following advice I decided to mock up parts of this using Python (first time using it) and am having a problem calling a Python module from my C code.

Here is the C code:

Code:
PyObject *module, *dict, *func, *value;
Py_Initialize();
module = PyImport_ImportModule("logic");
dict = PyModule_GetDict(module);
func = PyDict_GetItemString(dict, "logic");
value = PyObject_CallFunction(func, "HW", "");
	
Py_DECREF(module);
Py_DECREF(dict);
Py_DECREF(func);
Py_DECREF(value);
	
Py_Finalize();

Here is the Python code it is meant to call (just something simple to make sure it is working):

Code:
def HW():
   print 'Hello World!'

It seems to be crashing out at the PyModule_GetDict() function saying EXEC_BAD_ACCESS but I have no idea why? As far as I can see they are legitimate places to write and read from. The documentation is poor on this unfortunately.

The code compiles fine with no warnings or errors.
 

Cromulent

macrumors 604
Original poster
Oct 2, 2006
6,816
1,101
The Land of Hope and Glory
Bummer, if you had chosen Ruby I'd dig in and help!

Doh, unfortunately my Dad already has a few Python books so it makes sense to use that.

Although this is turning into a bigger problem than I realised. Hmm, stupid documentation. At least with C you are guaranteed to have decent documentation for the standard library and much used functions. I would have thought embedding a Python script in C would be an extremely well used part of the language.
 

Mac Player

macrumors regular
Jan 19, 2006
225
0
Why not java? Its easier than C and faster than python.

Edit: Does the 1 sec limit include the network delay?
 

Cromulent

macrumors 604
Original poster
Oct 2, 2006
6,816
1,101
The Land of Hope and Glory
Why not java? Its easier than C and faster than python.

Not sure about that. Python can be extremely fast. Anyway I'm using Python because you can extend a standard C program with it which gives great flexibility. You do all the mission critical stuff in C and all the setup in Python. Plus you can add new features easily with Python.

Java on the other hand is not a particularly easy language to learn, it has a huge sprawling collection of libraries and like C++ it has tried to do the one language fits all approach and is thus pretty unwieldy.
 

Cromulent

macrumors 604
Original poster
Oct 2, 2006
6,816
1,101
The Land of Hope and Glory
I thought the green thread mode was removed.

My point was that Java runs on top of a virtual machine. Anything Java does is native for the VM, but it then gets JIT compiled down to the machine level as it is run. Whether you count that as native is up for debate.

Python on the other hand has a tool which will take normal Python code and turn it directly into native x86 assembly which can then be assembled and run as a normal program. So in that case Python is actually more native than Java as it is properly compiled and not interpreted at all.

Therefore I would argue Python can be faster than Java.
 

Mac Player

macrumors regular
Jan 19, 2006
225
0
Java threads can execute concurrently, python threads can't. And the JVM runs faster even than psyco.
 

Cromulent

macrumors 604
Original poster
Oct 2, 2006
6,816
1,101
The Land of Hope and Glory
Java threads can execute concurrently, python threads can't. And the JVM runs faster even than psyco.

Having done a little Googling I see you are correct.

Still the point is moot because of these reasons:

  • C is faster than both Python and Java and I will be using that for the performance critical areas of the program.
  • Python can be called from within C threads which are native and can run concurrently.
  • Python is (IMO) a nicer language and easier to learn.
  • Python integrates with C in such a way as to make the program extensible in an easy and approachable manner.
 

lee1210

macrumors 68040
Jan 10, 2005
3,182
3
Dallas, TX
this is strictly academic, I do not support using Java for this task, but it can be called from C and vice versa by way of the Java Native Interface(JNI). I have not tried it with Python but I hope it is easier than JNI.

Also, the input filter you are building will be the input process/thread. For this problem I don't think you need multiple input threads so python's threading model is irrelvant to this problem. Thanks to those that brought it up, it is important to know in general. I just don't think for this project it will come to bear.

I mentioned it breifly before, but it may be worth considering using separate processes for I/O and data processing. These task don't need to interact aside from in the datastore. You can maintain coherency there via file locks or transactions in an RDBMS. With pthreads you need to learn a lot more(never a bad thing) before you can start working. You may need to use semaphores depending on your design. If you don't have a backing store whatever you've read is gone if your program is terminated. If you need a backing store anyway, using this with coprocesses is probably easier.

Anyway, good luck, keep posting with progress. I might pull out my machine in a few minutes and give the C/Python bridge a try. I'll let you know if I come up with anything.

-Lee
 

Cromulent

macrumors 604
Original poster
Oct 2, 2006
6,816
1,101
The Land of Hope and Glory
Good advice, I think a database may work well as I will have to query ranges of data. I assume something like PostgreSQL will be suitable. Although I think it might be overly complex at this moment in time.

Working with a common data store could be beneficial in other ways too. I'll have to think about that a little bit.

As for Python the C API for it seems very comprehensive and rather nice for a bridge.
 

lee1210

macrumors 68040
Jan 10, 2005
3,182
3
Dallas, TX
I've never used Python on my system before, so there were some issues getting it built and getting my module included, but I have something that seems to be working (I don't normally consider this test complete until I can pass a dynamic chunk of data like a list of variable length of string, but I'm pretty tired).

Here's what I have:
testpy.c:
Code:
#include <stdio.h>
#include "Python.h" //I had to use a long path, but I assume your system is better setup

int main(int argc, char *argv) {
  PyObject *module, *dict, *func, *value;
  Py_Initialize(); //Set up the interpreter
  PyRun_SimpleString("import sys \n"); //This line and the next
  PyRun_SimpleString("sys.path.append('.')\n"); //are only b/c of my system

  module = PyImport_ImportModule("logic"); //Import the module in logic.py
  dict = PyModule_GetDict(module); //I have no idea
  func = PyDict_GetItemString(dict, "HW"); //Get a reference to the function
  value = PyObject_CallFunction(func,NULL); //Call HW with no parameters

  Py_DECREF(module);
  Py_DECREF(dict);
  Py_DECREF(func);
  Py_DECREF(value);

  Py_Finalize();
  return 0;
}

logic.py:
Code:
def HW():
  print 'Hello from Python!'

When I built with GCC i had to explicitly include the python library. Again, hopefully your system has this setup already, but just an FYI.

-Lee
 

Cromulent

macrumors 604
Original poster
Oct 2, 2006
6,816
1,101
The Land of Hope and Glory
Crashes out when I try and run it in Xcode. I have the Python included with Xcode tools (2.5.1) and have included libpython.dylib into the project as well as the Python header file.

I'll have to do some more reading when I'm not so tired. I don't think I'm thinking straight enough to do any programming at the moment.
 

mamcx

macrumors regular
Mar 13, 2008
210
28
I think the best route is do this directly in python. You can put it to work faster with psyco and can solve the treading with stacless (but anyway, nobody can solve taht except if is erlang!).

Also, I think you can capture th datastream as fast as can, but delay the calculation thing.

Imagine is a waterfall. Put a lake to hold the water then later clean it. If the thing is something to be in a GUI the dealy is minimal.
 

lee1210

macrumors 68040
Jan 10, 2005
3,182
3
Dallas, TX
Crashes out when I try and run it in Xcode. I have the Python included with Xcode tools (2.5.1) and have included libpython.dylib into the project as well as the Python header file.

I'll have to do some more reading when I'm not so tired. I don't think I'm thinking straight enough to do any programming at the moment.

GDB it and see which of the PyObject pointers you have there come back null. The crash is likely passing one of those as null to another Python function, or when you do Py_DECREF on them. Once you know that it may be easier to figure out. I had the logic.py file in the same working directory as my program (I just used gcc from terminal, I wasn't trying Xcode), which is why I did:
Code:
sys.path.append('.')

I don't know what directory Xcode uses for its working directory, but it might be better for you, while you're trying this out, to put the logic.py file in a specific place, then just add that to Python's path using the sys.path.append function.

That was one of the causes of problems when I started out, the other was that I had to change the last argument of PyObject_CallFunction to NULL from the empty string. The rest of the issues I had seemed to be related to the environment, so if Xcode is already handling that you shouldn't have to worry about those.

-Lee

P.S. If you are going to go with a co-process method, you could just run a Python script by itself for the "data acquisition" process as mamcx suggested. However, I think these bridges are interesting so I like playing with them. Even after I went through the trouble of passing dynamic data to/from Fortran to C to Java (via JNI) and Fortran to C to C# (via Embedded Mono) I haven't ended up using either in production. In my case the overhead of starting a lot of JVMs for JNI or embedding the mono runtime in all of our processes wasn't worthwhile.
 

Cromulent

macrumors 604
Original poster
Oct 2, 2006
6,816
1,101
The Land of Hope and Glory
Bah, the program works perfectly when compiling on the command line :(. Yet another reason to hate IDE's which just get in your way...

I guess I'm going to have to have a closer look at Xcode.
 
Register on MacRumors! This sidebar will go away, and you'll see fewer ads.