Apr

01

Java vs. .NET Substring(..) tip

Here is a few notices about String datatype storing and Substring(..) function work difference in .NET and Java I stuck recently.



Short pre-history:

  I re-designed recently some function that realize some transformation over some (potentially large) strings. This function needs optimization in consuming memory and time (especially for case of large source strings). I will not describe much details = > as result new code was some kind of char scanner that is saving result into StringBuilder. On one of stage of this scanning - I need to check if some of pre-defined regular expressions is happened here (starting from current char). Of course, all regular expressions was re-designed to have such form:

  "^(regular_expr)", (^ - means starting point of source string in many regular expression syntax) 

 - to mark that matching regular expression should happen only from starting char (in other case regular expression continue to search entry till end of string = > and when this happens in large string and happens often - it would be very-very long process :-)). And it was unpleasure surprise for me that when I use Regex.Match(string input, int startat) method for some starting char inside string - this starting char was not detected as  FIRST one - i.e. match failed for all starting position >0 (!). Here is small example - just to describe what I mean:

.....................
Regex reg_exp = new Regex("^12");
 string str_source = "0123456789";
  
bool bRes = reg_exp.IsMatch(str_source, 1); //bRes will be FALSE !!
..........................

So, to realize needed for me check quickly - I just call String.Substring(pos) to provide substring to regular expression match :-) => and UPS :-( - the function on large string fall in almost "endless in time" process and always do something with memory :-(.



So, in such way - I found for myself interesting difference in String storing at JAVA and .NET (despite of both objects are immutable in their languages - i.e. constantly defined and cannot be changed):  

- JAVA: can store string as just pointers to sub-sequence in char array for other String object. So, call String.substring(pos) does NOT allocate new char array for substring and does NOT copy content to it;

- .NET: store String as char arrays only and each different String has separate char array storage. So, call String.Substring(pos) does allocate new char sequence for substring and copy content to it.

So, when the substring is using ONLY JUST for analyzing content (becuase of some functionality is not supported for part of string and only for whole string - as it was in my case) => in .NET you will need some other way to solve problem - because calling a lot of Substring(..) function will create a lot of char arrays allocation in heap (and this will be needed then to be collected by garbage collection etc.).



Here are 2 small pieces of code that demonstrate this difference:

JAVA:

 
public  static  void  main(String[] args)
{
   //Allocate large String (10Mb with different chars)
    char [] arr_test = new  char [10000000];
    for  (int  pos=0; poslength; pos++)
      arr_test[pos] = (char )(pos%128);
  String str_test = new  String(arr_test);
 
  String substr = null ;
   int  lcount = 0;

  //Now iterate through large string char positions
   for  (int  pos=0; pos<str_test.length(); pos++)
  {
     //For each CHAR postition take SUBSTRING started from it - i.e. 10 millions substrings :-)
     substr = str_test.substring(pos);
 
     //Do something with this substring to exclude optmization suppressing of substring calculation 
      if  (pos%2==0)
        lcount += (int )substr.charAt(0);
      else 
        lcount -= (int )substr.charAt(0);
  }
 
   //Just print dummy result 
  System.out .print("Result: ");
  System.out .print(lcount);
}
 

 - as you see here is Substring(..)  are calculated 10 millions times for large string (and some small operations are  performed over each substring). This code is working 1-2 seconds on my (not very fast)  laptop. And it's not consumed more memory than needed for source (10 millions char)  string (each substring lead to allocating just new pairs of int values according to visualvm JDK tool).

 


.NET (the same code - just adapted to C#):

public static void main(String[] args)
{
    //Allocate large String (10Mb with different chars)
 
    char[] arr_test = new char[10000000];
    for (int pos=0; pos
       arr_test[pos] = (char)(pos%128);
    String str_test = new String(arr_test);
 
    String substr = null;
     int lcount = 0;
   
    //Now iterate through large string char positions
 
     for (int pos=0; pos<str_test.Length; pos++)
    {
         //For each CHAR postition take SUBSTRING started from it - i.e. 10 millions substrings :-)
 
        substr = str_test.Substring(pos);
   
       //Do something with this substring to exclude optmization suppressing of substring calculation
 
        if (pos%2==0)
         lcount += (int)substr[0];
        else
         lcount -= (int)substr[0];
  }
   
  //Just print dummy result
 
  System.Diagnostics.Debug.Write("Result: ");
  System.Diagnostics.Debug.WriteLine(lcount);
}
 

- the same code as in Java - but it will not be finished even in 2 hours :-) (and always allocating memory and copying data - so processor was quite busy :-().

Tags: No tags defined!

Add comment


Security code
Refresh