Apr
01
Posted by: Sergiy Danilchenko
Here is a few notices about String datatype storing and Substring(..) function work difference in .NET and Java I stuck recently.
Short pre-history:
I re-designed recently some function that realize some transformation over some (potentially large) strings. This function needs optimization in consuming memory and time (especially for case of large source strings). I will not describe much details = > as result new code was some kind of char scanner that is saving result into StringBuilder. On one of stage of this scanning - I need to check if some of pre-defined regular expressions is happened here (starting from current char). Of course, all regular expressions was re-designed to have such form:
"^(regular_expr)", (^ - means starting point of source string in many regular expression syntax)
- to mark that matching regular expression should happen only from starting char (in other case regular expression continue to search entry till end of string = > and when this happens in large string and happens often - it would be very-very long process :-)). And it was unpleasure surprise for me that when I use Regex.Match(string input, int startat) method for some starting char inside string - this starting char was not detected as FIRST one - i.e. match failed for all starting position >0 (!). Here is small example - just to describe what I mean:
So, to realize needed for me check quickly - I just call String.Substring(pos) to provide substring to regular expression match :-) => and UPS :-( - the function on large string fall in almost "endless in time" process and always do something with memory :-(.
So, in such way - I found for myself interesting difference in String storing at JAVA and .NET (despite of both objects are immutable in their languages - i.e. constantly defined and cannot be changed):
- JAVA: can store string as just pointers to sub-sequence in char array for other String object. So, call String.substring(pos) does NOT allocate new char array for substring and does NOT copy content to it;
- .NET: store String as char arrays only and each different String has separate char array storage. So, call String.Substring(pos) does allocate new char sequence for substring and copy content to it.
So, when the substring is using ONLY JUST for analyzing content (becuase of some functionality is not supported for part of string and only for whole string - as it was in my case) => in .NET you will need some other way to solve problem - because calling a lot of Substring(..) function will create a lot of char arrays allocation in heap (and this will be needed then to be collected by garbage collection etc.).
Here are 2 small pieces of code that demonstrate this difference:
- as you see here is Substring(..) are calculated 10 millions times for large string (and some small operations are performed over each substring). This code is working 1-2 seconds on my (not very fast) laptop. And it's not consumed more memory than needed for source (10 millions char) string (each substring lead to allocating just new pairs of int values according to visualvm JDK tool).
.NET (the same code - just adapted to C#):
- the same code as in Java - but it will not be finished even in 2 hours :-) (and always allocating memory and copying data - so processor was quite busy :-().