Unicode url sanitising

Published on Tue Mar 01 2022

Additional matching regexes for
Unicode url sanitising

Match Javanese Script Syllable

Match any syllable based on javanese script unicode

Unicode username

Unicode user name check with som allowed non-alhanum characters

streets with one or more names with unicode characters in python

streets with one or more names with unicode characters in python

Remove Special Ascii Characters from unicode String

By Using this string you can just remove ascii special characters from a unicode string characters like ♥♥♥♥ ▓▒ and other non unicode letters.


This pattern check three words with a dot separator betwwern words. It use unicode; It is not yest compatible with javascript

Invalid Unicode characters in XML

This pattern matches all the Unicode characters that are not allowed in an XML document. It's based on the Wikipedia article "Valid characters in XML".

RFC 3987 compliant URL regex

This is a JavaScript port of the URL regex from http://stackoverflow.com/a/190405/384062 that includes a bug fix and some optimization. Mathias Bynens's Regenerate was used to convert unicode escapes. Bug fix: Eliminated stray | falsely allowing querystring to contain | Optimization: Merged alternated character classes in querystring and fragment identifier portions for better performance.

PHP file path with wrappers

Breaks file path up into wrappers, root, and path components. Understands both Windows (DOS) and Unix style paths. Wrappers and path components can be farther processed in code. Path component should support any visible Unicode character but not things like VT, HT or any other non-printing character. Most of the non-printable characters would also be allowed by file systems but are near impossible to enter.