Notice: this is a static mirror for historical purposes.

View Issue Details Jump to Notes ] Issue History ] Print ]
IDProjectCategoryView StatusDate SubmittedLast Update
083Source IntegrationGitwebpublic2009-12-05 08:552009-12-05 14:24
ReporterRoyston Shufflebotham 
Assigned To 
Product Version 
Target VersionFixed in Version 
Summary083: Gitweb fails to scrape with PHP 5.3.0 on Windows
DescriptionWhen I try to scrape a gitweb site ( [^]) using source-integration (commit a00ec98042f7044069f823787ab796c4522216bb) with PHP 5.3.0 (from Wampserver 2.0i) on a Windows box, the regexes fail to match.

In particular, the following regexes fail to match:

* commit author + date
* commit message
* files touched by commit

The failure to extract author is fatal, as the database table defines 'author' as NOT NULL.
Steps To ReproduceConfigure source-integration against [^] with PHP 5.3.0 on Windows; do an import (of everything).

Additional InformationI've got patches for this coming into [^] imminently.

The failures are mostly to do with unexpected newlines (needing extra '\s*' patterns) or a regexp single-line ('s') qualifier.

I'm a little puzzled, however, why the extra \s* are needed on Windows but not on Unix boxes. (I'm marginally more understanding of the need for the 's' qualifier.)
TagsNo tags attached.
Attached Files

- Relationships

-  Notes
User avatar (074)
John Reese (administrator)
2009-12-05 13:27

Perhaps the issue is caused by Windows' propensity to use \r as part of a newline?
User avatar (075)
Royston Shufflebotham (reporter)
2009-12-05 14:24

Well, I'm sort of suspecting something along those lines, but for at least one of the issues it's more a case of extra whitespace rather than the wrong sort of whitespace, which is where I'm confused about the PHP5.3 or Windows difference.

HTML emitted by gitweb on

... snip ...
<tr><td>author</td><td>John Reese <></td></tr>
<tr><td></td><td> Tue, 20 Oct 2009 16:17:28 +0000 (12:17 -0400)</td></tr>
... snip ...

The original regex in SourceGitweb.php doesn't expect any whitespace between the closing </tr> on the first line and the opening <tr> on the second line. I don't get why Windows would treat that differently.

Still, adding a \s* does make it work (and is what I'd expect), but I'd still prefer to understand precisely what the difference is. (And actually, why it obviously works on some machines without that \s*!)

- Issue History
Date Modified Username Field Change
2009-12-05 08:55 Royston Shufflebotham New Issue
2009-12-05 13:27 John Reese Note Added: 074
2009-12-05 14:24 Royston Shufflebotham Note Added: 075

Copyright © 2000 - 2012 MantisBT Group
Time: 0.1443 seconds.
memory usage: 8,324 KB
Powered by Mantis Bugtracker

hosted with