macOS C++ character conversions EBCDIC -> ASCII

toddburch · May 13, 2007

I've written a C++ app to translate EBCDIC to ASCII. Being fairly new to C++, I spent quite a bit of time on it, but its now working.

As part of the process, I wrote a dumping routine to output the incoming data into a hex-dump format display, like this:

Code:

00000000  C3F4F4F0 F1F3F0F3 F1F9F525 480CF0F0  F7F0F840 40C240F4 40000000 00914C00  *...........%H......@@.@.@.....L.*
00000020  00001CD5 D6D560C3 D6E5C5D9 C5C440C3  C8C1D9C7 C5E24000 0C00000C 00000C00  *......`.......@.......@.........*

Above, the first field is the hex offset into the record, then the record itself, then the eyecatcher area on the right, wrapped in "*"s.

Unpacking the data was the toughest part. For instance, converting an incoming character like 0xF2 and turning it into a string "F2". I'll show you how I did it (a cut-down version for simplicity), and would like to get some feedback on how I might have done it either more efficiently, or "better"** by leveraging (leaning) more on C++. The comments in the code and the code itself tell the story of the issue I was having when picking up a values X'80' or larger. I would be most interested simplifying this conversion process.

Thanks for your feedback. I've a thick skin, so let me have it!

Code:

#include <iostream>
using namespace std ; 

#define SAMPLE "1 Ring To Rule Them All"  // Sample string to convert 
#define HEXCHARS "0123456789ABCDEF"       // All valid hex chars 
 
void to_hex(char *indata, int i) ;  // Function prototype.  -> data, character to convert. 

int main (int argc, char * const argv[]) {
	int i ; 

	char outdata[((sizeof SAMPLE-1) * 2)+1] ;  // Double length + 1 for null terminator  
	outdata[sizeof outdata-1] = NULL ;   // Null terminate the string 
	
    for (i = 0 ; i < sizeof SAMPLE-1 ; i++ ) {  // Run the entire input string. 
		int c = SAMPLE[i] ;                   // Pick up a character 
		// Picking up 0X80 or above causes a negative value.  
		// So, add 256 to it to make it positive. 
		if (c < 0) c += 256 ;       // make positive if picking up the byte caused the sign bit to propagate.
		to_hex( &outdata[i*2], c) ; // Convert the byte just picked up 
	} 
	cout << '"' << SAMPLE << '"' << " converted is " << outdata << endl ;  // Normal Text
	
	// Now, show the scenario for a high value. 
	char temp[3] ; 
	temp[2] = NULL ;
	 
	char z = 0xFF ; 
	int j = (int) z ;  
	
	to_hex( temp , 0xFF ) ;  
	cout << "The HIGHVALUE is " << 0xFF << ".  As an integer it is " << j << ". Converted it is " << temp << endl ; 	
	return 0;
}

// Function to convert a byte to a hex displayable value. 
void to_hex(char *indata, int byte) { 
	indata[0] = HEXCHARS[ byte / 16] ;  // get left nibble 
	indata[1] = HEXCHARS[ byte % 16] ;  // right nibble 
	return ; 
}

I was reading up on the C++ I/O model with its Formatted I/O and the Manipulators for converting the output stream to hex, but these are only for integers, and it seems like it would be more work to do that than what I've got so far.

Thanks, Todd

** better = me writing less code

rand0m3r · May 13, 2007

just out of interest, why did you write such a program? there are already C++ libraries (glibmm) that perform character conversions.

toddburch · May 13, 2007

Like I said - I'm new to C++. I'm not even familiar with the STL, much less any other add-on libraries.

While I am confident character translators are available, I'm not so confident binary data type converters are so readily available. For instance, on z/OS, (an IBM mainframe operating system, that sits on IBM's z/Series hardware), there are special data types stored in several different binary formats, just as there are under OS X and Windows. Hexadecimal Floating Point (HFP on the mainframe) would be one example. Packed Decimal is another. Transfering a binary file that contained these data types from one platform to another, that data has to be converted, following the rules for transformation defined on the source platform, in order to be used on the new platform. If everything was TEXT based, there would be no issue, or need for specialized type conversions.

Does that answer your question?

lazydog · May 13, 2007

Hi
I think I would modify your code this way:-

Code:

for (i = 0 ; i < sizeof SAMPLE-1 ; ++i )
{
  unsigned char c = SAMPLE[ i ] ;
   *outdata++ = HEXCHARS[ c >> 4 ] ;
   *outdata++ = HEXCHARS[ c & 0xf ] ;
}

To avoid the c < 0 test I've used unsigned char.

I've replaced byte % 16 with byte & 0xff and byte / 16 with byte >> 4. I'm pretty sure this is more efficient then using / and %.

In your for loop you have i++. It's probably a good idea to get in the habit of using ++i here. Not that it makes much difference for the code you posted but in general there is a subtle difference between ++i and i++ (i++ results in a temporary).

I got rid of to_hex(). It's such a small function anyway. If efficiency is your concern then the overhead of the funciton call is approaching that of the function itself.

Hope this helps

b e n

toddburch · May 14, 2007

unsigned char! That is EXACTLY what I should have been using. And yes, the SHIFT and AND will be much more efficient than the integer and modulus division.

So, for me, this is the first time to use all three of these features in C++. (Well, I think it's only my 3rd or 4th C++ program to write too...) I use these types of techniques all the time in assembler on the mainframe, but just didn't think to use them in this "high level langauge"!

Good job!

Last thing. You are using *outdata++, while I pass the address of an element in my call-by-reference function call, and then use array index notation for the assignments. I haven't used pointer arithmetic yet in C/C++. Will outdata need to be reset following this loop to point back to the start of the array? How does that work?

Todd

gnasher729 · May 14, 2007

toddburch said:
unsigned char! That is EXACTLY what I should have been using. And yes, the SHIFT and AND will be much more efficient than the integer and modulus division.

Actually, any decent compiler will know how to perform a division and a modulo operation in the quickest possible way. In XCode, use "Show Assembly Code" from the Build menu and have a look. Then have a look how it performs a division by say ten.

unsigned int divide_by_ten (unsigned int x) { return x / 10; }

The assembler code will be a bit surprising. The best rule is to write code in the most readable way. Another good rule for fast code is write code in the same way as everyone else does, because then there is a good chance that the compiler knows the patterns that you use and compiles them in the best possible way.

Only go to less readable code if (1) performance is critical and (2) you have _measured_ the performance and know what parts of a program are actually slow. Most of the time programs are not slow because code isn't optimised, but because someone does something stupid. Obviously if you have done something stupid, it's not in the parts that you think would be slow, but somewhere else. Profiling it, or using Shark on the Macintosh, that will find it.

pilotError · May 14, 2007

Not to go too off topic, but if your doing a bunch of this type of stuff, we use something called fileport which deals with all those binary flavors pretty well.

One thing to note, you may want to keep your translation tables external in a flat file or xml file, this way you can change dictionaries. In some instances, you may want to change say a curly brace to a blank, in other cases, you may want a different character (going from ascii to ebcdic).

Oh the fun of Mainframe interaction! LOL

http://www.syncsort.com/products/ss/fp/home.htm

As far as your code, looks fine... The hex dump stuff is very useful, we build pretty much the same things everytime we put a new format online.

lazydog · May 14, 2007

toddburch said:
Last thing. You are using *outdata++, while I pass the address of an element in my call-by-reference function call, and then use array index notation for the assignments. I haven't used pointer arithmetic yet in C/C++. Will outdata need to be reset following this loop to point back to the start of the array? How does that work?
Todd

Yes you'll need to reset the pointer every time you convert your sample to hex. Something like this:-

Code:

static char outdata_bfr[] ;

char* outdata = outdata_bfr ;
for (i = 0 ; i < sizeof SAMPLE-1 ; ++i )
{
  unsigned char c = SAMPLE[ i ] ;
   *outdata++ = HEXCHARS[ c >> 4 ] ;
   *outdata++ = HEXCHARS[ c & 0xf ] ;
}

*outdata = '\0' ;

If you want to use a function call to do the conversion then something like this perhaps:-

Code:

static char outdata_bfr[] ;

char* outdata = outdata_bfr ;
for (i = 0 ; i < sizeof SAMPLE-1 ; ++i )
    to_hex( outdata, SAMPLE[ i ] ) ;


void to_hex( char* & indata, unsigned char byte )
{
 *indata++ = HEXCHARS[ byte >> 4 ] ;
   *indata ++ = HEXCHARS[ byte & 0xf ] ;
}

The advantage of using pointer stuff here is that outdata gets incremented through the loop so you don't need to recalculate the index each time round the loop like you had in your original loop, ie to_hex( &outdata[i*2], c).

But as a side, I still think having a function call to convert 1 byte into hex is over the top. A function to convert a data buffer into hex is much more useful, eg

Code:

void to_hex( unsigned char* data, int size, char* hex_bfr ) ;

or even

Code:

char* to_hex( unsigned char* data, int size ) ; // Creates and returns  the hex buffer

hope this helps!

b e n

toddburch · May 14, 2007

pilotError said:
Not to go too off topic, but if your doing a bunch of this type of stuff, we use something called fileport which deals with all those binary flavors pretty well.

Dang. A day late and a dollar short.

pilotError said:
One thing to note, you may want to keep your translation tables external in a flat file or xml file, this way you can change dictionaries. In some instances, you may want to change say a curly brace to a blank, in other cases, you may want a different character (going from ascii to ebcdic).

Yes, I had considered that, and will do something in that regard, be it a binary loadable table, or an xml file, or even a text-based file that could be parsed at runtine. I'm not sure how much flexibility is needed yet. I'm still in the early stages of writing. Primary objective - get it working. Then, add the fluff.

pilotError said:
As far as your code, looks fine... The hex dump stuff is very useful, we build pretty much the same things everytime we put a new format online.

Thanks for the code review. The full blown program is a bit more intense from this scaled down sample.

What do "y'all" do? PM if you want.

Todd

SilentPanda · May 14, 2007

Oh EBCDIC how I loathe thee...

I had to make a file on the PC but they stored some of the data in packed decimal... they were storing things like the day in 4 bytes (1 for month, 1 for day, 1 for century, and 1 for year)... why I have no clue aside from the fact that it was probably old code... I was being lazy and I only had to run the code the one time so I just made a conversion table in code to convert (for instance) 15 hex in EBCDIC to whatever the ASCII equivalent was and so on and so forth. At the time I didn't know enough about much of anything so I just found a conversion table online and coded to read that in... ah well.

Glad to see you got something working though. Even if there are libraries out there that do it, there's nothing wrong with coming up with your own way now and again. Sometimes you can come up with a better way or something that fits your particular situation. If all the code in the world was already written we'd all be out of jobs!

toddburch · May 14, 2007

lazydog said:
Yes you'll need to reset the pointer every time you convert your sample to hex. Something like this:...

Gotchya. You're using an extra pointer besides the array reference itself.

If you want to use a function call to do the conversion then something like this perhaps:...

Ok, just like I did Java a couple months ago, now I get to go research keywords like static. Fun, fun!

Ok, I've been using:

char *c ;

and you are using:

char* c ;

What's the difference?

I'll be getting rid of the to_hex() function. My hexprt() function is already doing all the dump formatting, so I can suck to_hex() up into that function. When I do that, it will be just as you describe in your void to_hex() function, but with only two parms (pointer, length), as it writes the dump directly to a particular file.

Yes, this has all helped quite a bit. Thanks again for taking the time.

Todd

SilentPanda · May 14, 2007

toddburch said:
Ok, I've been using:

char *c ;

and you are using:

char* c ;

What's the difference?

I don't believe there is one. It's just preference and habit. Some people do String[] x and others String x[].

I usually use char* c instead of char *c and String[] x instead of String x[]. Mostly because for me I feel it reads better. Stating what properties my variable is going to have and then what the name is and not mixing the two together.

lazydog · May 14, 2007

Yup, I prefer char* c too but it can lure you into a false sense of security:-

char* c, d ;

is not the same as

char* c ;
char* d ;

You need to write:-

char* c, *d ;

which is a bit ugly in my opinion!

b e n

toddburch · May 14, 2007

I've changed my code (not the above snippet) to use pointers. Works great!

Now that my dumping is working, now I'll create a mechanism to define my record layout, parse the data, do data validation and data type conversions. WOO-HOO!

C++ classes, here I come!

SilentPanda · May 14, 2007

lazydog said:
Yup, I prefer char* c too but it can lure you into a false sense of security:-

char* c, d ;

is not the same as

char* c ;
char* d ;

You need to write:-

char* c, *d ;

which is a bit ugly in my opinion!

b e n

Good point! I usually define my variables one at a time though... tends to save me time in the long run. Good point though for those that do.

fimac · May 14, 2007

lazydog said:
Yup, I prefer char* c too but it can lure you into a false sense of security:-

char* c, d ;

is not the same as

char* c ;
char* d ;

You need to write:-

char* c, *d ;

which is a bit ugly in my opinion!

This has been a recurring subject during my career -- and I have never decided upon a definitive answer. That said, mostly this week I have been using the "char* c" style, because the star is clearly part of the data-type. Which is nice.

Code:

typedef char* char_ptr_t;

As you noted, this rule breaks when multiple variables are declared at once. Personally, I find declaring one variable per line to be more maintainable, but YMMV

Search

Search

macOS C++ character conversions EBCDIC -> ASCII

toddburch

macrumors 6502a

rand0m3r

macrumors regular

toddburch

macrumors 6502a

lazydog

macrumors 6502a

toddburch

macrumors 6502a

gnasher729

Suspended

pilotError

macrumors 68020

lazydog

macrumors 6502a

toddburch

macrumors 6502a

SilentPanda

Moderator emeritus

toddburch

macrumors 6502a

SilentPanda

Moderator emeritus

lazydog

macrumors 6502a

toddburch

macrumors 6502a

SilentPanda

Moderator emeritus

fimac

macrumors member

Our Staff