Scala parser performance with v2.10.x
I have been tinkering with rewriting csscss in scala and have really enjoyed the language. Especially after working in Haskell. I noticed however that after upgrading my local java version to the latest v1.7 (currently 1.7 update 45) my parsing performance tanked.
My toy parser is regex based, and I asked on the mailing list if there have been any recent regex changes. I was pointed to the following links:
Turns out in 1.7 update 6, the semantics of
String.substring created a String, which shared an internal char value with an original String, which allowed you:
- To save some memory by sharing character data
- To run String.substring in a constant time ( O(1) )
At the same time such feature was a source of a possible memory leak…
The problem comes up when parsing in scala because
subSequence (which depends on
substring) is used heavily. So every parsing step would require at least O(n) of copying an array to a new array when creating a String.
In the scala bug, the reporter mentioned writing a custom
CharSequence that behaved the way same
String used to to reclaim the performance.
Here is a snippet that did the trick for me:
Instead of parsing a String, you can wrap this around the String and parse that to reclaim performance.
This strategy may be brought into scala v2.11, which should be released soon. But until then, you can use something similar.