MSDOTnet.org Forum Index
 
 FAQFAQ   SearchSearch   MemberlistMemberlist   UsergroupsUsergroups   RegisterRegister   ProfileProfile   Log in to check your private messagesLog in to check your private messages   Log inLog in 

Only allowing alphanumeric characters and '_' and '-'

 
Post new topic   Reply to topic    MSDOTnet.org Forum Index -> C Sharp
Author Message
DotNetNewbie



Joined: 11 Dec 2007
Posts: 7

PostPosted: Tue Feb 26, 2008 6:59 pm    Post subject: Only allowing alphanumeric characters and '_' and '-' Reply with quote

Hi,

I want to parse a string, ONLY allowing alphanumeric characters and
also the underscore '_' and dash '-' characters.

Anything else in the string should be removed.

I think my regex is looking like:

^([\w\d_-])*$


Now if I have this code:

string username = "mrcsharpis_so_cool!!!";

How can I strip all the characters that I dont' want?

Archived from group: microsoft>public>dotnet>languages>csharp
Back to top
View user's profile Send private message
KH



Joined: 08 Aug 2007
Posts: 2

PostPosted: Tue Feb 26, 2008 7:42 pm    Post subject: RE: Only allowing alphanumeric characters and '_' and '-' Reply with quote

Regex is a bit overkill for that; you could...

string str = "AB&*^#Cabc(#&123--__";

StringBuilder sb = new StringBuilder(str.Length);

foreach (char ch in str)
{
if (Char.IsLetterOrDigit(ch)
|| ch == '-' || ch == '_')
{
sb.Append(ch);
}
}

str = sb.ToString();


"DotNetNewbie" wrote:

> Hi,
>
> I want to parse a string, ONLY allowing alphanumeric characters and
> also the underscore '_' and dash '-' characters.
>
> Anything else in the string should be removed.
>
> I think my regex is looking like:
>
> ^([\w\d_-])*$
>
>
> Now if I have this code:
>
> string username = "mrcsharpis_so_cool!!!";
>
> How can I strip all the characters that I dont' want?
>
>
Back to top
View user's profile Send private message
Arne Vajhøj



Joined: 08 Aug 2007
Posts: 6

PostPosted: Tue Feb 26, 2008 11:30 pm    Post subject: Re: Only allowing alphanumeric characters and '_' and '-' Reply with quote

KH wrote:
> Regex is a bit overkill for that; you could...
>
> string str = "AB&*^#Cabc(#&123--__";
>
> StringBuilder sb = new StringBuilder(str.Length);
>
> foreach (char ch in str)
> {
> if (Char.IsLetterOrDigit(ch)
> || ch == '-' || ch == '_')
> {
> sb.Append(ch);
> }
> }
>
> str = sb.ToString();

I think that code is an overkill compared to a simple Regex.Replace !

Arne
Back to top
View user's profile Send private message
Jesse Houwing



Joined: 16 Aug 2007
Posts: 11

PostPosted: Wed Feb 27, 2008 4:32 am    Post subject: RE: Only allowing alphanumeric characters and '_' and '-' Reply with quote

Hello KH,

> Regex is a bit overkill for that; you could...
>
> string str = "AB&*^#Cabc(#&123--__";
>
> StringBuilder sb = new StringBuilder(str.Length);
>
> foreach (char ch in str)
> {
> if (Char.IsLetterOrDigit(ch)
> || ch == '-' || ch == '_')
> {
> sb.Append(ch);
> }
> }
> str = sb.ToString();

Though, should your requirements become more complex, a regex solution like
the following can be used:

string cleaned = Regex.Replace("string to clean", "[^\w\d_-]", "", RegexOptions.None);

Just put all the characters you want to keep into the range above. Everything
else will be removed.

Jesse

>
> "DotNetNewbie" wrote:
>
>> Hi,
>>
>> I want to parse a string, ONLY allowing alphanumeric characters and
>> also the underscore '_' and dash '-' characters.
>>
>> Anything else in the string should be removed.
>>
>> I think my regex is looking like:
>>
>> ^([\w\d_-])*$
>>
>> Now if I have this code:
>>
>> string username = "mrcsharpis_so_cool!!!";
>>
>> How can I strip all the characters that I dont' want?
>>
--
Jesse Houwing
jesse.houwing at sogeti.nl
Back to top
View user's profile Send private message
KH



Joined: 08 Aug 2007
Posts: 2

PostPosted: Tue Feb 26, 2008 10:09 pm    Post subject: Re: Only allowing alphanumeric characters and '_' and '-' Reply with quote

I usually avoid regex's because of performance. In this case I haven't tested
but would imagine the difference is approximatly "who cares" ... nonetheless
I just think of regex's as overkill in many situations where people try to
use them.

A great way to use them though is to put the pattern in a config file so it
can be easily changed when requirements change or for different customers w/o
recompiling the app.


"Arne Vajhøj" wrote:

> KH wrote:
> > Regex is a bit overkill for that; you could...
> >
> > string str = "AB&*^#Cabc(#&123--__";
> >
> > StringBuilder sb = new StringBuilder(str.Length);
> >
> > foreach (char ch in str)
> > {
> > if (Char.IsLetterOrDigit(ch)
> > || ch == '-' || ch == '_')
> > {
> > sb.Append(ch);
> > }
> > }
> >
> > str = sb.ToString();
>
> I think that code is an overkill compared to a simple Regex.Replace !
>
> Arne
>
Back to top
View user's profile Send private message
Peter Duniho



Joined: 08 Aug 2007
Posts: 45

PostPosted: Tue Feb 26, 2008 10:39 pm    Post subject: Re: Only allowing alphanumeric characters and '_' and '-' Reply with quote

On Tue, 26 Feb 2008 17:09:00 -0800, KH
wrote:

> "Arne Vajhøj" wrote:
>> I think that code is an overkill compared to a simple Regex.Replace !
>
> I usually avoid regex's because of performance. In this case I haven't
> tested
> but would imagine the difference is approximatly "who cares" ...
> nonetheless
> I just think of regex's as overkill in many situations where people try
> to
> use them.

It's funny. I agree with both statements, sort of. (Do you smell an
essay coming on? You should... Smile )

Fundamentally, I think that Regex is a good thing. It's a concise,
reliable way to represent various string interpretations and
manipulations. As far as performance goes, I don't think there's a
reliable way to say that Regex is always better- or worse-performing than
an equivalent explicit algorithm.

However, I do think that it's likely that Regex performs better for at
least a broad variety of possible applications, if not the majority. As a
framework class, it's got the potential to be well-optimized and there's
good justification for it to be. On the other hand, explicit algorithms
may or may not be well-optimized, depending on who wrote the code and how
often it's likely to be used.

In addition, every time you write an explicit algorithm, you risk writing
it wrong. With Regex, yes there's the possibility of writing an incorrect
expression, but it's more likely in that case that it just won't work.
It's much harder to get those subtle "happens once in awhile with only
this very specific input". Not impossible, but IMHO more difficult.

So those are all things in favor of Regex. I think that in general,
anything that allows you to specify an operation in a concise, error-free
way and then perform that operation with reasonable, or even optimal
speed, that's a good thing.

But with Regex, the conciseness is IMHO a bit overboard. I recognize that
there are folks out there who have used regular expressions so much that
it's just like writing regular programming code to them. They know it
inside and out.

But for the rest of us, using Regex is an exercise in frustration as we
skip back and forth in the MSDN documentation trying to find just the
right syntax for representing some goal. There's an incredible amount of
capability there, and with that comes a fairly extensive grammar that
needs to be learned to use it effectively. But the syntax of that grammar
is pretty arcane IMHO, and has been very hard to learn, at least for me.

I wish we had something like Regex, but with a more natural-language-like
way to program it. Maybe something like a RegexBuilder class or something
that you can use to construct an appropriate regular expression. Or maybe
just a syntax that looks more like C# than like APL. Or maybe something
that takes actual C# code expressions and converts it into a suitable
regular expression. Or some alternative I've yet to consider.

I don't know what the actual solution is. All I know is that Regex itself
can be very trying to use if you're inexperienced with it, to a _much_
greater extent than, say, VB or C# might be. So in the end, for simple
operations I find myself thinking "well, some explicit C# code will be
clearer, and it should be easy to make it bug-free", and so I wind up not
using Regex there. And then for more complex operations, where the
conciseness and precision of Regex would be a benefit, I find myself
thinking "I just don't get how to do this in Regex and the docs aren't
helping me figure it out", and so I wind up not using Regex.

Which means that either way, I don't use Regex. I've posted questions
here asking how to write Regex expressions to do what I want, and to the
credit of the newsgroup experts who do know Regex, they've always come
through. For me, and for others who ask similar questions. Jesse Houwing
in particular deserves major kudos for his Regex "kung fu" and his
willingness to share it with others. But in the end, if I can't be
self-reliant on a technology, I tend not to use it.

Maybe if I had greater need to doing string pattern matching, I'd take the
time and really learn regular expressions and then it'd be useful. But I
don't, and for the occasional moments when it'd be useful to me, it's just
not worth the time and effort to figure out that specific case.

I'd love to see someone fix that problem. Smile

Pete
Back to top
View user's profile Send private message
KWienhold



Joined: 11 Aug 2007
Posts: 1

PostPosted: Wed Feb 27, 2008 4:11 am    Post subject: Re: Only allowing alphanumeric characters and '_' and '-' Reply with quote

On 27 Feb., 02:39, "Peter Duniho"
wrote:
> On Tue, 26 Feb 2008 17:09:00 -0800, KH  
> wrote:
>
> > "Arne Vajhøj" wrote:
> >> I think that code is an overkill compared to a simple Regex.Replace !
>
> > I usually avoid regex's because of performance. In this case I haven't  
> > tested
> > but would imagine the difference is approximatly "who cares" ...  
> > nonetheless
> > I just think of regex's as overkill in many situations where people try  
> > to
> > use them.
>
> It's funny.  I agree with both statements, sort of.  (Do you smell an  
> essay coming on?  You should...  Smile )
>
> Fundamentally, I think that Regex is a good thing.  It's a concise,  
> reliable way to represent various string interpretations and  
> manipulations.  As far as performance goes, I don't think there's a  
> reliable way to say that Regex is always better- or worse-performing than  
> an equivalent explicit algorithm.
>
> However, I do think that it's likely that Regex performs better for at  
> least a broad variety of possible applications, if not the majority.  As a  
> framework class, it's got the potential to be well-optimized and there's  
> good justification for it to be.  On the other hand, explicit algorithms  
> may or may not be well-optimized, depending on who wrote the code and how  
> often it's likely to be used.
>
> In addition, every time you write an explicit algorithm, you risk writing  
> it wrong.  With Regex, yes there's the possibility of writing an incorrect  
> expression, but it's more likely in that case that it just won't work.  
> It's much harder to get those subtle "happens once in awhile with only  
> this very specific input".  Not impossible, but IMHO more difficult.
>
> So those are all things in favor of Regex.  I think that in general,  
> anything that allows you to specify an operation in a concise, error-free  
> way and then perform that operation with reasonable, or even optimal  
> speed, that's a good thing.
>
> But with Regex, the conciseness is IMHO a bit overboard.  I recognize that  
> there are folks out there who have used regular expressions so much that  
> it's just like writing regular programming code to them.  They know it  
> inside and out.
>
> But for the rest of us, using Regex is an exercise in frustration as we  
> skip back and forth in the MSDN documentation trying to find just the  
> right syntax for representing some goal.  There's an incredible amount of  
> capability there, and with that comes a fairly extensive grammar that  
> needs to be learned to use it effectively.  But the syntax of that grammar  
> is pretty arcane IMHO, and has been very hard to learn, at least for me.
>
> I wish we had something like Regex, but with a more natural-language-like  
> way to program it.  Maybe something like a RegexBuilder class or something  
> that you can use to construct an appropriate regular expression.  Or maybe  
> just a syntax that looks more like C# than like APL.  Or maybe something  
> that takes actual C# code expressions and converts it into a suitable  
> regular expression.  Or some alternative I've yet to consider.
>
> I don't know what the actual solution is.  All I know is that Regex itself  
> can be very trying to use if you're inexperienced with it, to a _much_  
> greater extent than, say, VB or C# might be.  So in the end, for simple  
> operations I find myself thinking "well, some explicit C# code will be  
> clearer, and it should be easy to make it bug-free", and so I wind up not  
> using Regex there.  And then for more complex operations, where the  
> conciseness and precision of Regex would be a benefit, I find myself  
> thinking "I just don't get how to do this in Regex and the docs aren't  
> helping me figure it out", and so I wind up not using Regex.
>
> Which means that either way, I don't use Regex.  I've posted questions  
> here asking how to write Regex expressions to do what I want, and to the  
> credit of the newsgroup experts who do know Regex, they've always come  
> through.  For me, and for others who ask similar questions.  Jesse Houwing  
> in particular deserves major kudos for his Regex "kung fu" and his  
> willingness to share it with others.  But in the end, if I can't be  
> self-reliant on a technology, I tend not to use it.
>
> Maybe if I had greater need to doing string pattern matching, I'd take the  
> time and really learn regular expressions and then it'd be useful.  But I  
> don't, and for the occasional moments when it'd be useful to me, it's just  
> not worth the time and effort to figure out that specific case.
>
> I'd love to see someone fix that problem.  Smile
>
> Pete

While I do use Regex from time to time (input field validation,
parsing Sql-Connection-strings etc.), I totally agree with Peter.
Whenever I do use regular expressions it would have been quite trivial
to achieve the same thing in code, when the pattern matching becomes
complex enough to really make you want the power the Regex engine
offers, I often find I just can't get the expression to work right in
all circumstances.
A library that would offer a more natural way of constructing regular
expressions would be great, but given the complexity of the syntax
(let alone the fact that there are several different implementations),
I don't quite see how that could be done...

Kevin Wienhold
Back to top
View user's profile Send private message
Stefan Nobis



Joined: 25 Feb 2008
Posts: 2

PostPosted: Wed Feb 27, 2008 3:32 pm    Post subject: Re: Only allowing alphanumeric characters and '_' and '-' Reply with quote

"Peter Duniho" writes:

> Fundamentally, I think that Regex is a good thing.

Fundamentally a RegEx is a type 3 grammar, equivalent to a finite
automata. Smile

So a RegEx is more like an upper bound to a class of pattern matching
problems. Sometimes a RegEx is not enough, then you need to go up in
the hierachy to type 2 grammars and write parsers. But in many cases
you don't need all of the expressiveness of a RegEx so you can use
quite simpler constructs.

BTW: In the class of parsing problems where regular expressions
suffice, using a RegEx parser is the most costly (sane) way to do the
job. Simple comparisios like IsDigitOrLetter (traversing the input
string only once, without the overhead of parser generation) are
always (much) faster and need (much) less memory.

Some problems need full regluar expression expressiveness, so in these
cases the cost and overhead of a RegEx is mandatory.

> As far as performance goes, I don't think there's a reliable way to
> say that Regex is always better- or worse-performing than an
> equivalent explicit algorithm.

These class of problems are really good studies and understood. There
are quite reliable ways to say when a RegEx is needed, what performance
and memory characterics follow and when other way are needed or more
efficient.

These and much more are the basics of computer science. There's more
to programming than just try&error.

> other hand, explicit algorithms may or may not be well-optimized,

But a regular expression may also be badly written and as such induce
much more overhead and worse performance for the same regular
expression engine used with a better written RegEx. A regluar
expression is a simple language but still complex enough to say the
same thing in different ways.

If you do basic comparision of algorithms you have always to assume
that the implementation are written as good as possible (for example a
routine to copy a 10 character long string should not need 50MB RAM
and quite some minutes of runtime to do it's job; it's always possible
to do worse, we are only interested if it's possible to do better).

> In addition, every time you write an explicit algorithm, you risk
> writing it wrong. With Regex, yes there's the possibility of
> writing an incorrect expression, but it's more likely in that case
> that it just won't work. It's much harder to get those subtle
> "happens once in awhile with only this very specific input". Not
> impossible, but IMHO more difficult.

You didn't write quite some complex regular expressions, did you? A
RegEx is quite easy to have those subtle problems. But you are not
wrong. A regular expression is a type 3 grammar, C# has (more or less)
a type 2 grammar (it's even Turing complete), so it's much more
expressive and so there exists much more potential for errors.

> But for the rest of us, using Regex is an exercise in frustration as
> we skip back and forth in the MSDN documentation trying to find just
> the right syntax for representing some goal. There's an incredible
> amount of capability there, and with that comes a fairly extensive
> grammar that needs to be learned to use it effectively. But the
> syntax of that grammar is pretty arcane IMHO, and has been very hard
> to learn, at least for me.

The concept of regular expressions are not that difficult. The most
common representation in todays languages are pure artificial. Other
representations and syntaxes are possible and do exists; for the
language Common Lisp exists a library called cl-ppcre implementing a
quite efficient regular expression engine (for some examples even
faster than the C engine) -- this engine understands the common
representations but also allows another syntax:

CL-USER> (ppcre::parse-string "^([\w\d_-])*$")
(:SEQUENCE :START-ANCHOR (:GREEDY-REPETITION 0 NIL (:REGISTER (:CHAR-CLASS #\w #\d #\_ #\-))) :END-ANCHOR)

It's quite long representation and maybe to some eyes even worse but
showing that other ways to notated a RegEx are quite possible.

> questions here asking how to write Regex expressions to do what I
> want

Maybe have a look at

http://weitz.de/regex-coach/

a IMHO quite useful tool to learn regular expressions and to
experiment with them.

--
Stefan.
Back to top
View user's profile Send private message
Stefan Nobis



Joined: 25 Feb 2008
Posts: 2

PostPosted: Wed Feb 27, 2008 3:51 pm    Post subject: Re: Only allowing alphanumeric characters and '_' and '-' Reply with quote

Stefan Nobis writes:

> CL-USER> (ppcre::parse-string "^([\w\d_-])*$")
> (:SEQUENCE :START-ANCHOR (:GREEDY-REPETITION 0 NIL (:REGISTER (:CHAR-CLASS #\w #\d #\_ #\-))) :END-ANCHOR)

Ups, bad example. The simple translator doen't convert \w and
\d. Sorry. It should read more like this (to put everything except \w
- and _ in the register):

(:SEQUENCE :START-ANCHOR
(:GREEDY-REPETITION 0 NIL
(:REGISTER
(:INVERTED-CHAR-CLASS :WORD-CHAR-CLASS
#\_
#\-)))
:END-ANCHOR)

The first to parameters to :GREEDY-REPETITION meening the min and max
allowed number of repetitions (the above 0 NIL corresponds to the *,
something like (:GREEDY-REPETITION 3 5 ...) corresponds to
....{3,5}). The syntax #\_ is Common Lisp syntax for the single
character _.

Here is a handwritten example using the verbose syntax (I
don't have the perl-like version at hand, sorry):

(:sequence :start-anchor (:alternation #\# ";;;")
(:positive-lookahead :word-char-class)
(:register (:greedy-repetition 0 nil :word-char-class))
(:positive-lookahead
(:alternation :end-anchor
(:sequence
(:greedy-repetition 1 nil
:whitespace-char-class)
:non-whitespace-char-class)))
(:greedy-repetition 0 1
(:sequence
(:greedy-repetition 1 nil :whitespace-char-class)
(:register (:greedy-repetition 0 nil :everything)))))

--
Stefan.
Back to top
View user's profile Send private message
Arne Vajhøj



Joined: 08 Aug 2007
Posts: 6

PostPosted: Thu Feb 28, 2008 2:53 am    Post subject: Re: Only allowing alphanumeric characters and '_' and '-' Reply with quote

KH wrote:
> I usually avoid regex's because of performance. In this case I haven't tested
> but would imagine the difference is approximatly "who cares" ... nonetheless
> I just think of regex's as overkill in many situations where people try to
> use them.

Usually fewer lines of code is what is most cost effective overall.

Regex is simple code (and if the reader knows regex as a general concept
it is even easy to read) and code that is easy to modify to different
requirements.

It does come with a certain overhead. It may not be suited for
being called billions or trillions of times. But I doubt that was
the case here (the variable was named 'username').

Arne

Back to top
View user's profile Send private message
Display posts from previous:   
Related Topics:
XMLTextReader reading too many characters I'm stumped though I have an idea of what might be happening. I would appreciate any help someone might give/suggest. I have a well formed XML document. Here is an example below. The real thing (a file) is almost a gig in size.
What is the memory requirement for certain characters? Hi, In my application I use byte arrays to hold values like IPv4 and IPv6 addresses, which will then be sent to specific streams. In an IPv4 address, the octet separator is a dot (.) in an IPv6 address, the octet separator is a semicolon (:) So given an I

IsolatedStorage limited to 260 characters (MAX_PATH). I am having a problem with a deployed application that is throwing a when using to read / write files. I found the following post in the discussion group but the link does not work. Since can

How to Marshal Strings containing Null characters Hello, I am using the OPENFILENAME structure in VB.net to get multiple file names. The doc. says that for Explorer type dialogs the directory and file name strings returned in the lpstrfile element are NULL separated, with an extra NULL character after t

Cannot enter special characters in DataGridView via Alt+0XXX Hi! I have a problem with the .NET DataGridView control. It's demonstrated by building a trivial test project in Visual Studio 2005, using these steps: 1. Ctrl+Shift+N to create a new project. 2. Select a WinForms project, accepting all of the defaults. 3
Post new topic   Reply to topic    MSDOTnet.org Forum Index -> C Sharp All times are GMT
Page 1 of 1

 
Jump to:  
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum


Powered by phpBB © 2001, 2005 phpBB Group