|Anonymous | Login | Signup for a new account||2012-09-13 19:32 PDT|
|Main | Blog | My View | View Issues | Change Log | Roadmap | IRC Chat | Repositories | Scrum Board|
|View Issue Details|
|ID||Project||Category||View Status||Date Submitted||Last Update|
|083||Source Integration||Gitweb||public||2009-12-05 08:55||2009-12-05 14:24|
|Target Version||Fixed in Version|
|Summary||083: Gitweb fails to scrape with PHP 5.3.0 on Windows|
|Description||When I try to scrape a gitweb site (http://git.mantisforge.org/w/meta.git [^]) using source-integration (commit a00ec98042f7044069f823787ab796c4522216bb) with PHP 5.3.0 (from Wampserver 2.0i) on a Windows box, the regexes fail to match.|
In particular, the following regexes fail to match:
* commit author + date
* commit message
* files touched by commit
The failure to extract author is fatal, as the database table defines 'author' as NOT NULL.
|Steps To Reproduce||Configure source-integration against http://git.mantisforge.org/w/meta.git [^] with PHP 5.3.0 on Windows; do an import (of everything).|
|Additional Information||I've got patches for this coming into http://git.mantisforge.org/w/source-integration/rws.git [^] imminently.|
The failures are mostly to do with unexpected newlines (needing extra '\s*' patterns) or a regexp single-line ('s') qualifier.
I'm a little puzzled, however, why the extra \s* are needed on Windows but not on Unix boxes. (I'm marginally more understanding of the need for the 's' qualifier.)
|Tags||No tags attached.|
John Reese (administrator)
|Perhaps the issue is caused by Windows' propensity to use \r as part of a newline?|
Royston Shufflebotham (reporter)
Well, I'm sort of suspecting something along those lines, but for at least one of the issues it's more a case of extra whitespace rather than the wrong sort of whitespace, which is where I'm confused about the PHP5.3 or Windows difference.
HTML emitted by gitweb on git.mantisforge.org:
... snip ...
<tr><td>author</td><td>John Reese <email@example.com></td></tr>
<tr><td></td><td> Tue, 20 Oct 2009 16:17:28 +0000 (12:17 -0400)</td></tr>
... snip ...
The original regex in SourceGitweb.php doesn't expect any whitespace between the closing </tr> on the first line and the opening <tr> on the second line. I don't get why Windows would treat that differently.
Still, adding a \s* does make it work (and is what I'd expect), but I'd still prefer to understand precisely what the difference is. (And actually, why it obviously works on some machines without that \s*!)
|2009-12-05 08:55||Royston Shufflebotham||New Issue|
|2009-12-05 13:27||John Reese||Note Added: 074|
|2009-12-05 14:24||Royston Shufflebotham||Note Added: 075|
| Copyright © 2000 - 2012 MantisBT Group
Time: 0.1443 seconds.|
memory usage: 8,324 KB