I made a code that analyze a provided HTML chunk to find all the occurences of a certain balise, and retrieve the unique data within each and every of these balises.
It's a matter of about 200 (maybe 1.000) occurences per strings, from a file of 3 millions of caracters, with several of these files.
So far with a shorter file of 32 occurences, it takes already 10 seconds to find them all, but with the bigger files, it take 44s to find only 32 of the occurence, so I expect 5mn to process the whole file.
There are oblivious flaws in my code, it's kind of dirty, but I know no other way of going around.
Most notably, I don't know how to retrieve more than one variable per regex test, and I don't know how to retrieve the Nth match of a regex variable if the variable has multiple match.
As a result, my code looks like this:
- Every tick, if the string match the regex (First test)
- - Set a dictionnary entry to a RegexMatchAt() (Second test)
- - Set the original string to a RegexReplace() (third test) to remove the previously gathered match, so it's not matched again.
starts over at the next tick until done.
38 millions caracters processed a three thousands times does sound like a lot of processing.
The back of my mind is telling me I could gather all datas in a single test, but I just have no idea how.