Skip to content

Node gone when tags and info fields are present in the select list. #91

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
angelcervera opened this issue Oct 6, 2021 · 1 comment
Closed
Assignees
Labels
bug spark Spark connector

Comments

@angelcervera
Copy link
Member

angelcervera commented Oct 6, 2021

Posted in Gitter:
Hi. I encountered a very weird problem when using osm4scala which I cannot really explain :-(.
I have a PBF which has a node with id 5103977631

# osmium getid input.osm.pbf n5103977631 -o /tmp/output.osm.pbf
[======================================================================] 100% 
# osmium cat /tmp/output.osm.pbf -f osm
<?xml version='1.0' encoding='UTF-8'?>
<osm version="0.6" generator="osmium/1.13.1">
  <bounds minlat="-90" minlon="-180" maxlat="90" maxlon="180"/>
  <node id="5103977631" version="1" timestamp="2017-09-13T19:57:39Z" uid="74746" changeset="52018502" lat="26.1914693" lon="-81.689915"/>
</osm>
When reading the same input PBF using osm4scala, I can perfectly read the same node:
spark.read.format("osm.pbf")
  .load("/mnt/data/input.osm.pbf")
  .filter("type == 0")
  .select("id","type","latitude","longitude","nodes","relations","tags")
  .filter("id == 5103977631")
+----------+----+-----------------+------------------+-----+---------+----+
|        id|type|         latitude|         longitude|nodes|relations|tags|
+----------+----+-----------------+------------------+-----+---------+----+
|5103977631|   0|26.1914693        |-81.689915       |   []|       []|  {}|
+----------+----+-----------------+------------------+-----+---------+----+

However, when I add the column "info" in the select cause, I'm getting this:

spark.read.format("osm.pbf")
  .load("/mnt/data/input.osm.pbf")
  .filter("type == 0")
  .select("id","type","latitude","longitude","nodes","relations","tags","info")
  .filter("id == 5103977631")
+---+----+--------+---------+-----+---------+----+----+
| id|type|latitude|longitude|nodes|relations|tags|info|
+---+----+--------+---------+-----+---------+----+----+
+---+----+--------+---------+-----+---------+----+----+
=> suddenly, the node can no longer be found?

So you would assume something is wrong with the "info" column, right? Let's try removing the "tags" column and we keep the "info" column

spark.read.format("osm.pbf")
  .load("/mnt/data/input.osm.pbf")
  .filter("type == 0")
  .select("id","type","latitude","longitude","nodes","relations","info")
  .filter("id == 5103977631")
+----------+----+-----------------+------------------+-----+---------+--------------------+
|        id|type|         latitude|         longitude|nodes|relations|                info|
+----------+----+-----------------+------------------+-----+---------+--------------------+
|5103977631|   0|26.1914693       | -81.689915       |   []|       []|{1, 2017-09-13 19...|
+----------+----+-----------------+------------------+-----+---------+--------------------+

The node can be found again??? Hu! :-D

Some environment details:

  • Azure Databricks cluster 8.3
  • Spark 3.1.1
  • Scala 2.12
  • osm4scala com.acervera.osm4scala:osm4scala-spark3_2.12:1.0.8

Somebody an idea what I'm doing wrong?

@angelcervera angelcervera added bug spark Spark connector labels Oct 6, 2021
@angelcervera angelcervera self-assigned this Oct 6, 2021
@angelcervera
Copy link
Member Author

angelcervera commented Oct 8, 2021

Bug reason:
There are 4 bytes before every header that says the length of the header.

If the start of a split is between these bytes, the next header (so the next blob as well) is ignored.

Usually, because there are not a lot of blocks (17.126 on the planet from Geofabrik), the chance to get this error is really low. But because in pbfs file with a lot of small blocks (let's say million), then the chances to get the error is high.

Steps to fix:

  • Reproduce error in a unit test
  • Fix
  • Update documentation if necessary
  • Update change log

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug spark Spark connector
Projects
None yet
Development

No branches or pull requests

1 participant