macOS Generating data with "regex" in Java

SilentPanda · Apr 14, 2010

I'm needing a way for users to specify the way that data is generated. For instance they might need to create a last name comprised of alpha characters and is 4-16 long. Or maybe an internal number that has to match "X\d{4}400".

My though was to write a reverse regex type thing. They could define the data with a regex and it would generate the data off it. Some aspects would not be supported such as +,*,?... mostly because they wouldn't make sense in the application. Primarily it would support (|), {x[,y]}, [], that kind of stuff. I know it's not regular expressions, but it's using that notation as a base.

Is this the wrong way to go about it? Honestly the users probably won't use anything super in depth as none of them know much about regex. However they do need to be somewhat specific in their data definitions and I felt this would be a way they could do that.

Just kinda doing a sanity check before I drive myself insane...

Heck there might already be an alternative library out there that does what I need... I just can't find one.

mags631 · Apr 14, 2010

SilentPanda said:
I'm needing a way for users to specify the way that data is generated. For instance they might need to create a last name comprised of alpha characters and is 4-16 long. Or maybe an internal number that has to match "X\d{4}400".

My though was to write a reverse regex type thing. They could define the data with a regex and it would generate the data off it. Some aspects would not be supported such as +,*,?... mostly because they wouldn't make sense in the application. Primarily it would support (|), {x[,y]}, [], that kind of stuff. I know it's not regular expressions, but it's using that notation as a base.

Is this the wrong way to go about it? Honestly the users probably won't use anything super in depth as none of them know much about regex. However they do need to be somewhat specific in their data definitions and I felt this would be a way they could do that.

Just kinda doing a sanity check before I drive myself insane... Heck there might already be an alternative library out there that does what I need... I just can't find one.

Is the only input the string format? Or will last names, numeric ids, etc. be passed?
Is the function to generate the next literal?
Do you need to guarantee uniqueness?

My immediate thought was regex substitution... but it will depend on the requirements.

SilentPanda · Apr 14, 2010

The users will need to come up with the notation and then my class will generate a random sampling of data based on the notation. The data will be alpha, numeric, alphanumeric, and mixed (symbols). The data may need to be unique, for instance if they want 500 pieces of data generated.

Right now they have their data definitions written in a Word document as "I need a last name, it's only alpha characters, and the min length is 5 and the max length is 14". Or, "I need a social security number, it's 9 numbers long". Or, "I need a unique internal ID which is 5 alpha characters followed by 3 numbers and ends with a Z".

So when I make thousands of records for them, it would be much easier for them to say:

Last Name - [A-Za-z]{5,14}
SSN - [0-9]{9}
Internal ID - [A-Z]{5}[0-9]{3}Z

and do that 100,000 times.

lee1210 · Apr 14, 2010

This may be a little more difficult, but i'd be inclined to have them pick the specification from some series of dropdowns, etc. that show them something in english. For example:
Dropdown 1:
Alpha
Numerical
Symbol
Alphanumerical
Alpha and Symbol
Numerical and Symbol
Any
Specified Set

If specified set is chosen, have a text entry Field 1 where they can enter the characters.

Dropdown 2, minimum number of this character type

Dropdown 3, maximum number of this character type

Field 2, description of this data

They pick, and then choose "Add to specification", you build up a full specification from any number of these individual character groupings. You can generate a description in english of what they've chosen, with a small (5ish) sample of what will be generated.

Once they're ready to submit, you can store this however you want in the background. If you really want to, you could display this to the user and allow a "shortcut entry" if they know the syntax you're using in the background.

I guess if your users are super-technical you could make them enter a seemingly random string of gibberish, but that seems pretty mean if they are not also programmers.

-Lee

SilentPanda · Apr 14, 2010

That's a good point. They'll primarily be putting the data notation into an Excel spreadsheet which my application will the interpret. But I could at least offer a UI for some of the easier and more common things they will be doing. It would then make the encoded string for them to paste into their Excel document. Most of their data is probably going to be [A-Z]{x,y} and [0-9]{x,y} anyway. The reason we're going a bit further is for those fields that do require a little bit more... oomph. There will probably be a few of these per spreadsheet.

mags631 · Apr 14, 2010

I don't think you should use regex (well maybe to parse the rules). The output function should decompose a literal into stems, with individual rules for stems. E.g., here is a simple Python version:

Code:

import random

class Stem:
	def __init__(self, constant_stem=None, valid_chars=None, min_length=0, max_length=0):
		'''char_range is a string of valid characters'''
		self.constant_stem = constant_stem
		self.valid_chars = valid_chars
		self.min_length = min_length
		self.max_length = max_length
		
		
	def generate(self):
		# if this is a constant stem then just return it as the stem
		if self.constant_stem is not None:
			return self.constant_stem
		# otherwise, generate it randomly
		stem = u''
		for i in range(random.randint(self.min_length, self.max_length)):
			random_c = self.valid_chars[random.randint(0, len(self.valid_chars) - 1)]
			stem = stem + random_c
		return stem
		

class Literal:
	def __init__(self, stem_defs):
		self.stems = list()
		for stem_def in stem_defs:
			self.stems.append(Stem(*stem_def))
			
	def generate(self):
		literal = u''
		for stem in self.stems:
			literal = literal + stem.generate()
		return literal
	
	def generateTimes(self, number):
		literal_list = list()
		for i in range(number):
			literal_list.append(self.generate())
		return literal_list

And it generates:

Code:

>>> reload(RandomLiteral)
<module 'RandomLiteral' from 'RandomLiteral.py'>
>>> ssn_literal = RandomLiteral.Literal([
... (None, "0123456789", 3, 3),
... ("-"),
... (None, "0123456789", 2, 2),
... ("-"),
... (None, "0123456789", 4, 4)
... ])
>>> ssn_literal.generate()
u'972-59-7621'
>>> ssn_literal.generateTimes(100)
[u'383-48-5897', u'249-65-8404', u'709-43-4150',  ....]
>>>

mrbash · Apr 15, 2010

Panda: I don't believe there is an easy for you to do this. The reverse process of pattern-> string is generally non-deterministic. A simple class like [.3*] can have any number of different strings that would be satisfactory.

I think you'll probably have to start off with some simplifying assumptions.

SilentPanda · Apr 20, 2010

Well I finished this up yesterday. It works pretty well for my purposes. It supports escaping certain characters with \, nested parenthesis grouping mostly for "OR" statements, "OR" statements with the |, character classes with the [], and explicit ranges with {}.

I coded things fairly close to "spec" when possible even when it wasn't needed, such as I really had no need to escape the + operator as it's not supported. But in the odd even I did need to implement it and it made sense in the future, it shouldn't be as much of a big deal...

Came out to about 300ish lines of code for the class and about 350 for all my junit tests... first time I've used junit but I'm very happy with them. I had to overhaul something in the middle of coding and it was nice to be able to run my tests to ensure I hadn't broken anything!

lee1210 · Apr 20, 2010

Does this code belong to your employer? If not, can you post it for the benefit of others?

-Lee

SilentPanda · Apr 20, 2010

It does... I had thought about posting it but it's not "mine"... bleah. It's not terribly complex and I would actually be up for posting it otherwise for people to beat up on how inefficient it is and how they'd do it this other way instead...

Actually I like that kind of stuff as it lets you learn...

macsmurf · Apr 20, 2010

One way of doing it would be to translate the regexp to a finite automaton (directed graph) and then travel through it backwards from an accept state.

I don't know if that is what you have done.

Search

Search

macOS Generating data with "regex" in Java

SilentPanda

Moderator emeritus

mags631

Guest

SilentPanda

Moderator emeritus

lee1210

macrumors 68040

SilentPanda

Moderator emeritus

mags631

Guest

mrbash

macrumors 6502

SilentPanda

Moderator emeritus

lee1210

macrumors 68040

SilentPanda

Moderator emeritus

macsmurf

macrumors 65816

Our Staff